There is no understanding of the relationship or the 3d nature of the space.
I don’t think you can claim that. Research has repeatedly shown that image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.
So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
I don’t think you can claim that. Research has repeatedly shown that image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.
Early work on the subject: https://arxiv.org/pdf/2409.09144
So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
1: https://arxiv.org/abs/2210.13382
2: https://arxiv.org/abs/2405.07987