Yeah they may be the same weights. The above quote does not absolutely imply the same weights generate the text and images IMO, just that it’s based on the 4o and sees the whole prompt. OpenAI’s audio generation is also ‘native’, but it’s served as a separate model on the API with different release dates, and you can’t mix audio and some function calling in chatgpt in a way that’s consistent with them not actually being the same weights.
Of course we don’t know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check (‘Is this an unsafe prompt’). That’s been demonstrated by showing that content in the chat appears in the images even if it’s not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.
A call to that download URL, which returns a raw image.
A second call to that same URL (why?), which fetches from cache.
Those three calls are repeated a number of times (four in my test), with the four returned images being the various progressive stages of the image, laid out left to right in the following screenshot:
There’s clearly some kind of backend-to-backend traffic (if nothing else, image versions have to get to that oaiusercontent server), but I see nothing to indicate whether that includes a call to a separate model.
The various twitter threads linked (eg this one) seem to be getting info (the specific messages) from another source, but I’m not sure where (maybe they’re using the model via API?).
Interesting! When someone says in that thread, “the model generating the images is not the one typing in the conversation”, I think they’re basing it on the API call which the other thread I linked shows pretty conclusively can’t be the one generating the image, and which seems (see responses to Janus here) to be part of the safety stack.
In this chat I just created, GPT-4o creates an image and then correctly describes everything in it. We could maybe tell a story about the activations at the original-prompt token positions providing enough info to do the description, but then that would have applied to nearcyan’s case as well.
I see, I didn’t read the thread you linked closely enough. I’m back to believing they’re probably the same weights.
I’d like to point out, though, that in the chat you made, ChatGPT’s description gets several details wrong. If I ask it for more detail within your chat, it gets even more details wrong (describing the notebook as white and translucent instead of brown, for example). In one of my other generations it also used a lot of vague phrases like “perhaps white or gray”.
When I sent the image myself it got all the details right. I think this is good evidence that it can’t see the images it generates as well as user-provided images. Idk what this implies but it’s interesting ¯\_(ツ)_/¯
That’s absolutely fascinating—I just asked it for more detail and it got everything precisely correct (updated chat). That makes it seem like something is present in my chat that isn’t being shared; one natural speculation is internal state preserved between token positions and/or forward passes (eg something like Coconut), although that’s not part of the standard transformer architecture, and I’m pretty certain that open AI hasn’t said that they’re doing something like that. It would be interesting if that’s that’s what’s behind the new GPT-4.1 (and a bit alarming, since it would suggest that they’re not committed to consistently using human-legible chain of thought). That’s highly speculative, though. It would be interesting to explore this with a larger sample size, although I personally won’t be able to take that on anytime soon (maybe you want to run with it?).
@brambleboy (or anyone else), here’s another try, asking for nine randomly chosen animals. Here’s a link to just the image, and (for comparison) one with my request for a description. Will you try asking the same thing (‘Thanks! Now please describe each subimage.’) and see if you get a similarly accurate description (again there are a a couple of details that are arguably off; I’ve now seen that be true sometimes but definitely not always—eg this one is extremely accurate).
(I can’t try this myself without a separate account, which I may create at some point)
The links were also the same for me. I instead tried a modified version of the nine random animals task myself, asking for a distinctive object at the background of each subimage. It was again highly accurate in general and able to describe the background objects in great detail (e.g., correctly describing the time that the clock on the top middle image shows), but it also got the details wrong on a couple of images (the bottom middle and bottom right ones).
Your ‘just the image’ link is the same as the other link that includes the description request, so I can’t test it myself. (unless I’m misunderstanding something)
Oh, I see why; when you add more to a chat and then click “share” again, it doesn’t actually create a new link; it just changes which version the existing link points to. Sorry about that! (also @Rauno Arike)
So the way to test this is to create an image and only share that link, prior to asking for a description.
Just as recap, the key thing I’m curious about is whether, if someone else asks for a description of the image, the description they get will be inaccurate (which seemed to be the case when @brambleboytried it above).
Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).
So that’s pretty clear evidence that there’s something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it’s not really parsing the image.
Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.
Really interesting, thanks for collaborating on it!
Also Patrick Leask noticed some interesting things about the blurry preview images:
If the model knows what it’s going to draw by the initial blurry output, then why’s it a totally different colour? It should be the first image attached.Looking at the cat and sunrise images, the blurred images are basically the same but different colours. This made me think they generate the top row of output tokens, and then they just extrapolate those down over a textured base image.I think the chequered image basically confirms this—it’s just extrapolating the top row of tiles down and adding some noise (maybe with a very small image generation model)
Fascinating! I’m now wondering whether it’s possible to test the Coconut hypothesis. An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them, but at least I can’t switch between the models with my non-subscriber account—is this possible with a subscriber account?
Edit: I’m actually unsure about whether 4.1 has image generation functionality at all. The announcement only mentions image understanding, not generation, and image generation is available for neither 4o nor 4.1 through the API. They say that “Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks” in the 4o image generation announcement, so if it becomes available for 4o but not for 4.1, that would be evidence that image generation requests are currently always handled with 4o. This would make the Coconut hypothesis less likely in my eyes—it seems easier to introduce such a drastic architecture change for a new model (although Coconut is a fine-tuning technique, so it isn’t impossible that they applied this kind of fine-tuning on 4o).
An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them
Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it’s API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.
4o → 4.5: success (in describing the image correctly)
4o → o4-mini-high (‘great at visual reasoning’): success
o4-mini-high’s reasoning summary was interesting (bolding mine):
The user wants me to identify both the animals and their background objects in each of the nine subimages, based on a 3x3 grid. The example seems to incorrectly pair a fox with a straw hat, but the actual image includes different combinations. For instance, the top left shows a fox in front of a straw sun hat, while other animals like an elephant, raccoon, hamster, and bald eagle are set against varying objects like bicycles, umbrellas, clapperboards, and a map. I’ll make sure to carefully match the animals to their backgrounds based on this.
There’s one more X thread which made me assume a while ago that there’s a call to a separate image model. I don’t have time to investigate this myself at the moment, but am curious how this thread fits into the picture in case there’s no separate model.
The running theory is that that’s the call to a content checker. Note the content in the message coming back from what’s ostensibly the image model:
"content": {
"content_type": "text",
"parts": [
"GPT-4o returned 1 images. From now on do not say or show ANYTHING. Please end this turn now. I repeat: ..."
]
}
That certainly doesn’t seem to be either image data or an image filename, or mention an image attachment.
But of course much of this is just guesswork, and I don’t have high confidence in any of it.
Thanks! I also believe there’s no separate image model now. I assumed that the message you pasted was a hardcoded way of preventing the text model from continuing the conversation after receiving the image from the image model, but you’re right that the message before this one is more likely to be a call to the content checker, and in that case, there’s no place where the image data is passed to the text model.
Although there are a couple of small details where the description is maybe wrong? They’re both small enough that they don’t seem like significant evidence against, at least not without a larger sample size.
Yeah they may be the same weights. The above quote does not absolutely imply the same weights generate the text and images IMO, just that it’s based on the 4o and sees the whole prompt. OpenAI’s audio generation is also ‘native’, but it’s served as a separate model on the API with different release dates, and you can’t mix audio and some function calling in chatgpt in a way that’s consistent with them not actually being the same weights.
Of course we don’t know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check (‘Is this an unsafe prompt’). That’s been demonstrated by showing that content in the chat appears in the images even if it’s not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.
I’ve now done some investigation of browser traffic (using Firefox’s developer tools), and the following happens repeatedly during image generation:
A call to https://chatgpt.com/backend-api/conversation/<hash1>/attachment/file_<hash2>/download (this is the same endpoint that fetches text responses), which returns a download URL of the form https://sdmntprsouthcentralus.oaiusercontent.com/files/<hash2>/raw?<url_parameters>.
A call to that download URL, which returns a raw image.
A second call to that same URL (why?), which fetches from cache.
Those three calls are repeated a number of times (four in my test), with the four returned images being the various progressive stages of the image, laid out left to right in the following screenshot:
There’s clearly some kind of backend-to-backend traffic (if nothing else, image versions have to get to that oaiusercontent server), but I see nothing to indicate whether that includes a call to a separate model.
The various twitter threads linked (eg this one) seem to be getting info (the specific messages) from another source, but I’m not sure where (maybe they’re using the model via API?).
Also @brambleboy @Rauno Arike
This thread shows an example of ChatGPT being unable to describe the image it generated, though, and other people in the thread (seemingly) confirm that there’s a call to a separate model to generate the image. The context has an influence on the images because the context is part of the tool call.
Interesting! When someone says in that thread, “the model generating the images is not the one typing in the conversation”, I think they’re basing it on the API call which the other thread I linked shows pretty conclusively can’t be the one generating the image, and which seems (see responses to Janus here) to be part of the safety stack.
In this chat I just created, GPT-4o creates an image and then correctly describes everything in it. We could maybe tell a story about the activations at the original-prompt token positions providing enough info to do the description, but then that would have applied to nearcyan’s case as well.
I see, I didn’t read the thread you linked closely enough. I’m back to believing they’re probably the same weights.
I’d like to point out, though, that in the chat you made, ChatGPT’s description gets several details wrong. If I ask it for more detail within your chat, it gets even more details wrong (describing the notebook as white and translucent instead of brown, for example). In one of my other generations it also used a lot of vague phrases like “perhaps white or gray”.
When I sent the image myself it got all the details right. I think this is good evidence that it can’t see the images it generates as well as user-provided images. Idk what this implies but it’s interesting ¯\_(ツ)_/¯
That’s absolutely fascinating—I just asked it for more detail and it got everything precisely correct (updated chat). That makes it seem like something is present in my chat that isn’t being shared; one natural speculation is internal state preserved between token positions and/or forward passes (eg something like Coconut), although that’s not part of the standard transformer architecture, and I’m pretty certain that open AI hasn’t said that they’re doing something like that. It would be interesting if that’s that’s what’s behind the new GPT-4.1 (and a bit alarming, since it would suggest that they’re not committed to consistently using human-legible chain of thought). That’s highly speculative, though. It would be interesting to explore this with a larger sample size, although I personally won’t be able to take that on anytime soon (maybe you want to run with it?).
@brambleboy (or anyone else), here’s another try, asking for nine randomly chosen animals. Here’s a link to just the image, and (for comparison) one with my request for a description. Will you try asking the same thing (‘Thanks! Now please describe each subimage.’) and see if you get a similarly accurate description (again there are a a couple of details that are arguably off; I’ve now seen that be true sometimes but definitely not always—eg this one is extremely accurate).
(I can’t try this myself without a separate account, which I may create at some point)
The links were also the same for me. I instead tried a modified version of the nine random animals task myself, asking for a distinctive object at the background of each subimage. It was again highly accurate in general and able to describe the background objects in great detail (e.g., correctly describing the time that the clock on the top middle image shows), but it also got the details wrong on a couple of images (the bottom middle and bottom right ones).
Your ‘just the image’ link is the same as the other link that includes the description request, so I can’t test it myself. (unless I’m misunderstanding something)
Oh, I see why; when you add more to a chat and then click “share” again, it doesn’t actually create a new link; it just changes which version the existing link points to. Sorry about that! (also @Rauno Arike)
So the way to test this is to create an image and only share that link, prior to asking for a description.
Just as recap, the key thing I’m curious about is whether, if someone else asks for a description of the image, the description they get will be inaccurate (which seemed to be the case when @brambleboy tried it above).
So here’s another test image (borrowing Rauno’s nice background-image idea): https://chatgpt.com/share/680007c8-9194-8010-9faa-2594284ae684
To be on the safe side I’m not going to ask for a description at all until someone else says that they have.
Just tried it. The description is in fact completely wrong! The only thing it sort of got right is that the top left square contains a rabbit.
Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).
So that’s pretty clear evidence that there’s something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it’s not really parsing the image.
Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.
Really interesting, thanks for collaborating on it!
Also Patrick Leask noticed some interesting things about the blurry preview images:
Fascinating! I’m now wondering whether it’s possible to test the Coconut hypothesis. An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them, but at least I can’t switch between the models with my non-subscriber account—is this possible with a subscriber account?
Edit: I’m actually unsure about whether 4.1 has image generation functionality at all. The announcement only mentions image understanding, not generation, and image generation is available for neither 4o nor 4.1 through the API. They say that “Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks” in the 4o image generation announcement, so if it becomes available for 4o but not for 4.1, that would be evidence that image generation requests are currently always handled with 4o. This would make the Coconut hypothesis less likely in my eyes—it seems easier to introduce such a drastic architecture change for a new model (although Coconut is a fine-tuning technique, so it isn’t impossible that they applied this kind of fine-tuning on 4o).
Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it’s API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.
4o → 4.5: success (in describing the image correctly)
4o → o4-mini-high (‘great at visual reasoning’): success
o4-mini-high’s reasoning summary was interesting (bolding mine):
There’s one more X thread which made me assume a while ago that there’s a call to a separate image model. I don’t have time to investigate this myself at the moment, but am curious how this thread fits into the picture in case there’s no separate model.
The running theory is that that’s the call to a content checker. Note the content in the message coming back from what’s ostensibly the image model:
That certainly doesn’t seem to be either image data or an image filename, or mention an image attachment.
But of course much of this is just guesswork, and I don’t have high confidence in any of it.
Thanks! I also believe there’s no separate image model now. I assumed that the message you pasted was a hardcoded way of preventing the text model from continuing the conversation after receiving the image from the image model, but you’re right that the message before this one is more likely to be a call to the content checker, and in that case, there’s no place where the image data is passed to the text model.
Although there are a couple of small details where the description is maybe wrong? They’re both small enough that they don’t seem like significant evidence against, at least not without a larger sample size.