So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
So do you think this is part of how it generates images—i.e. having used depth estimation and much else to infer 3D objects/scenes and the wider workings of the 3D world from its training photographs, it turns a new description into an internal representation of a 3D scene and then renders it to photorealistic 2D using inter alia a kind of reversal of its 2D->3D inference algorithm???
Which seems miraculous to me—not so much that this is possible, but that a neural network figured this all out by itself rather than the complex algorithms required being very elaborately hand-coded.
I would have assumed that at best a neural network would infer a big pile of kludges that would produce poor results, like a human trying to Photoshop a bunch of different photographs together.
Yes, that’s what I’m implying. We don’t have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a “program” (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or “world” that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word “dog”).
1: https://arxiv.org/abs/2210.13382
2: https://arxiv.org/abs/2405.07987