Not the way you think. You are seeing the mask and trying to understand it, while ignoring the shoggoth underneath.
The topic has gotten a lot of discussion, but from the relevant context of the shoggoth, not the irrelevant point of the mask. Every post about how do we know when AI is telling the truth vs repeating what is in training data, the talk of p-zombies, etc. that is all directly struggling with your question.
Edit: to be clear, I am not saying the fact ais make mistakes shows they’re inhuman, I am saying look at the mistake and ask yourself what does the fact that the ai made this specific mistake and not a different one tell you about why the ai thought the mistake was correct /edit
Here is an example. Alpha go beat all the best players in the world at go. They described its thinking as completely alien and they were emotionally distraught at how badly they lost. A couple months later multiple mediocre players obliterated it repeatedly and reliably. It turned out the AI didn’t know it was playing a game that had a board with conditions that persisted from turn to turn, it didn’t/doesn’t understand that there are fields that are controlled with pieces inside. It can figure out the best next move without that understanding. It now wins all the time again. But it still doesn’t understand the context. There is no way to confirm it knows it is playing a game.
Going directly at images, why are you so sure dalle knows what an image is or what it is doing? Why do you think it knows there is a person looking at its output vs an electric field pulsing in a non random way? Does it understand we see red but not infrared? Is it adding details only tetrachromate can see? No one has the answers to these questions. No one has a technique that has a plausible method of working to get the answers.
Your image prompt might be creating a pattern in binary, or hexadecimal color codes, but it is probably a bizarre set of tokens that fit like Cthulhu-esque jigsaw pieces in a mosaic using more relationships than human minds can comprehend. I saw a claim that chatgpt 4 broke language out into a 36,000+ dimensional table of relationships. It ain’t using hooked on phonics but it certainly can trick you into believing it does. That tricking you is the mask. The 36,000 dimensional map of language is part of the shoggoth, not the whole thing.
To make the mask slip in images give it tasks that rely on relationships not facades. For example, and I apologize if you are easily offended, do a search on hyper muscle growth. You will get porn, sorry. But in it you will find images with impossible mass and realistic internal anatomy. The artists understand skeletons and where muscle attaches to limbs. Drop some of the most extreme art into sora or nano banana or grok and animate it. The skeletons lose all coherence. The facade, the skin and limbs move, but what is going on inside cannot be. Skeletons don’t do what the image generators will try, because they can’t see the invisible. And skeletons are invisible. For a normally proportioned human that’s irrelevant, but for an impossible proportion it is. 3d artists draw skeleton wireframes then layer polygons above it so the range of motion fits plwhat is possible and correct. AI copies the training data and extrapolates. Impossible shapes cause the mask to slip, you sent doesn’t know what a body is it thinks we’re blobs that move and have precise shapes.
Monsters are another one. Try a displacer beast, this is a cat with six legs, four that comport to the shape of traditional feline front legs and two that are rear legs, plus it has two tentacles coming off its shoulders that are like the arms (not tentacles) of a squid with the barbed diamond shaped pads. The difference between tentacle motion and leg motion is unknown to AI because it relies on what is unseen, the skeleton underneath. Again, it think a monster is a blob that moves.
Getting to your architecture, this you see in window and door placement. There is no understanding of the relationship or the 3d nature of the space. Instead it knows walls have doors and windows and doors are more likely down low and windows are more likely up high. So it adds them. But it doesn’t understand space or function.
When you see people talk about AI slop articles using the it’s not x it’s y pattern or triplets or em dashes, this is the same topic. This is how the AI knows what it knows. Why it thinks that is good writing, but uses it in ways humans don’t even though it got the pattern out of human made training data. Same topic, different application.
There is no understanding of the relationship or the 3d nature of the space.
I don’t think you can claim that. Research has repeatedly shown that image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.
So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
why are you so sure dalle knows what an image is or what it is doing? Why do you think it knows there is a person looking at its output vs an electric field pulsing in a non random way?
I don’t think it does (and didn’t say this!)
My question is how it manages to produce, almost all the time, such convincing 3D images without knowing about the 3D world and everything else normally required to create a realistic 3D image. As you can’t just do it by fitting together existing loosely suitable photographs (from its training data) and tweaking the results.
I don’t deny that you can catch it out by asking for weird things very different from its training data (though it often makes a good attempt). However that doesn’t explain how it does so well at creating images which are broadly within the range of its training data, but different enough from the photographs it’s seen that a skilled human with Photoshop couldn’t do what it does.
Not the way you think. You are seeing the mask and trying to understand it, while ignoring the shoggoth underneath.
The topic has gotten a lot of discussion, but from the relevant context of the shoggoth, not the irrelevant point of the mask. Every post about how do we know when AI is telling the truth vs repeating what is in training data, the talk of p-zombies, etc. that is all directly struggling with your question.
Edit: to be clear, I am not saying the fact ais make mistakes shows they’re inhuman, I am saying look at the mistake and ask yourself what does the fact that the ai made this specific mistake and not a different one tell you about why the ai thought the mistake was correct /edit
Here is an example. Alpha go beat all the best players in the world at go. They described its thinking as completely alien and they were emotionally distraught at how badly they lost. A couple months later multiple mediocre players obliterated it repeatedly and reliably. It turned out the AI didn’t know it was playing a game that had a board with conditions that persisted from turn to turn, it didn’t/doesn’t understand that there are fields that are controlled with pieces inside. It can figure out the best next move without that understanding. It now wins all the time again. But it still doesn’t understand the context. There is no way to confirm it knows it is playing a game.
Going directly at images, why are you so sure dalle knows what an image is or what it is doing? Why do you think it knows there is a person looking at its output vs an electric field pulsing in a non random way? Does it understand we see red but not infrared? Is it adding details only tetrachromate can see? No one has the answers to these questions. No one has a technique that has a plausible method of working to get the answers.
Your image prompt might be creating a pattern in binary, or hexadecimal color codes, but it is probably a bizarre set of tokens that fit like Cthulhu-esque jigsaw pieces in a mosaic using more relationships than human minds can comprehend. I saw a claim that chatgpt 4 broke language out into a 36,000+ dimensional table of relationships. It ain’t using hooked on phonics but it certainly can trick you into believing it does. That tricking you is the mask. The 36,000 dimensional map of language is part of the shoggoth, not the whole thing.
To make the mask slip in images give it tasks that rely on relationships not facades. For example, and I apologize if you are easily offended, do a search on hyper muscle growth. You will get porn, sorry. But in it you will find images with impossible mass and realistic internal anatomy. The artists understand skeletons and where muscle attaches to limbs. Drop some of the most extreme art into sora or nano banana or grok and animate it. The skeletons lose all coherence. The facade, the skin and limbs move, but what is going on inside cannot be. Skeletons don’t do what the image generators will try, because they can’t see the invisible. And skeletons are invisible. For a normally proportioned human that’s irrelevant, but for an impossible proportion it is. 3d artists draw skeleton wireframes then layer polygons above it so the range of motion fits plwhat is possible and correct. AI copies the training data and extrapolates. Impossible shapes cause the mask to slip, you sent doesn’t know what a body is it thinks we’re blobs that move and have precise shapes.
Monsters are another one. Try a displacer beast, this is a cat with six legs, four that comport to the shape of traditional feline front legs and two that are rear legs, plus it has two tentacles coming off its shoulders that are like the arms (not tentacles) of a squid with the barbed diamond shaped pads. The difference between tentacle motion and leg motion is unknown to AI because it relies on what is unseen, the skeleton underneath. Again, it think a monster is a blob that moves.
Getting to your architecture, this you see in window and door placement. There is no understanding of the relationship or the 3d nature of the space. Instead it knows walls have doors and windows and doors are more likely down low and windows are more likely up high. So it adds them. But it doesn’t understand space or function.
When you see people talk about AI slop articles using the it’s not x it’s y pattern or triplets or em dashes, this is the same topic. This is how the AI knows what it knows. Why it thinks that is good writing, but uses it in ways humans don’t even though it got the pattern out of human made training data. Same topic, different application.
People really are talking about it a lot
I don’t think you can claim that. Research has repeatedly shown that image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.
Early work on the subject: https://arxiv.org/pdf/2409.09144
So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
1: https://arxiv.org/abs/2210.13382
2: https://arxiv.org/abs/2405.07987
I don’t think it does (and didn’t say this!)
My question is how it manages to produce, almost all the time, such convincing 3D images without knowing about the 3D world and everything else normally required to create a realistic 3D image. As you can’t just do it by fitting together existing loosely suitable photographs (from its training data) and tweaking the results.
I don’t deny that you can catch it out by asking for weird things very different from its training data (though it often makes a good attempt). However that doesn’t explain how it does so well at creating images which are broadly within the range of its training data, but different enough from the photographs it’s seen that a skilled human with Photoshop couldn’t do what it does.