Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).
So that’s pretty clear evidence that there’s something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it’s not really parsing the image.
Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.
Really interesting, thanks for collaborating on it!
Also Patrick Leask noticed some interesting things about the blurry preview images:
If the model knows what it’s going to draw by the initial blurry output, then why’s it a totally different colour? It should be the first image attached.Looking at the cat and sunrise images, the blurred images are basically the same but different colours. This made me think they generate the top row of output tokens, and then they just extrapolate those down over a textured base image.I think the chequered image basically confirms this—it’s just extrapolating the top row of tiles down and adding some noise (maybe with a very small image generation model)
Fascinating! I’m now wondering whether it’s possible to test the Coconut hypothesis. An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them, but at least I can’t switch between the models with my non-subscriber account—is this possible with a subscriber account?
Edit: I’m actually unsure about whether 4.1 has image generation functionality at all. The announcement only mentions image understanding, not generation, and image generation is available for neither 4o nor 4.1 through the API. They say that “Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks” in the 4o image generation announcement, so if it becomes available for 4o but not for 4.1, that would be evidence that image generation requests are currently always handled with 4o. This would make the Coconut hypothesis less likely in my eyes—it seems easier to introduce such a drastic architecture change for a new model (although Coconut is a fine-tuning technique, so it isn’t impossible that they applied this kind of fine-tuning on 4o).
An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them
Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it’s API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.
4o → 4.5: success (in describing the image correctly)
4o → o4-mini-high (‘great at visual reasoning’): success
o4-mini-high’s reasoning summary was interesting (bolding mine):
The user wants me to identify both the animals and their background objects in each of the nine subimages, based on a 3x3 grid. The example seems to incorrectly pair a fox with a straw hat, but the actual image includes different combinations. For instance, the top left shows a fox in front of a straw sun hat, while other animals like an elephant, raccoon, hamster, and bald eagle are set against varying objects like bicycles, umbrellas, clapperboards, and a map. I’ll make sure to carefully match the animals to their backgrounds based on this.
Just tried it. The description is in fact completely wrong! The only thing it sort of got right is that the top left square contains a rabbit.
Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).
So that’s pretty clear evidence that there’s something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it’s not really parsing the image.
Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.
Really interesting, thanks for collaborating on it!
Also Patrick Leask noticed some interesting things about the blurry preview images:
Fascinating! I’m now wondering whether it’s possible to test the Coconut hypothesis. An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we’ve been investigating occurs for both of them, but at least I can’t switch between the models with my non-subscriber account—is this possible with a subscriber account?
Edit: I’m actually unsure about whether 4.1 has image generation functionality at all. The announcement only mentions image understanding, not generation, and image generation is available for neither 4o nor 4.1 through the API. They say that “Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks” in the 4o image generation announcement, so if it becomes available for 4o but not for 4.1, that would be evidence that image generation requests are currently always handled with 4o. This would make the Coconut hypothesis less likely in my eyes—it seems easier to introduce such a drastic architecture change for a new model (although Coconut is a fine-tuning technique, so it isn’t impossible that they applied this kind of fine-tuning on 4o).
Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it’s API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.
4o → 4.5: success (in describing the image correctly)
4o → o4-mini-high (‘great at visual reasoning’): success
o4-mini-high’s reasoning summary was interesting (bolding mine):