Swimmer963 highlights DALL-E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still extremely good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:
How many of these “cannot do’s” will be solved by throwing more compute and training data at the problem? Anyone know if we’ve started hitting diminishing returns with this stuff yet?
In general, we have not topped out on pretty much any scaling curve. Whether it’s language modeling, image generation, DRL, or whathaveyou, AFAIK, not a single modality can be truly said to have been ‘solved’ with the scaling curve broken. Either the scaling curve is flat, or we’re still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) Diffusion models’ only scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla on them yet.
So yes, we definitely can just make all the compute-budgets 10x larger without wasting it.
To go through the specific issues (caveat: we don’t know if Make-A-Scene solves any of these because no one can use it; and I have not read the Cogview2 paper*):
anime & realistic faces are purely self-imposed problems by OA. DALL-E 2 will do them fine just as soon as OA wants it to, and other models by other orgs do just fine on those domains. So no real problem there.
text in images: this is an odd one. This is especially odd because it destroys the commercial application of any image with text in it (because it’s garbage—who’d pay for these?), and if you go back to DALL-E 1, one of the demos was it putting text into images like onto generated teapots or storefronts. It was imperfect, but DALL-E 2 is way worse at it, it looks like. I mean, DALL-E 1 would’ve at least spelled ‘Avengers’ correctly. Nostalgebraist has also shown you can get excellent text generation with a specialized small model, and people using compviz (also much smaller than DALL-E 2) get good text results. So text in images is not intrinsically hard, this is a DALL-E 2-specific problem, whatever it is.
Why? As Nostalgebraist discusses at length in his earlier post, the unCLIP approach to using GLIDE to create the DALL-E 2 system seems to have a lot of weird drawbacks and tradeoffs. Just as CLIP’s contrastive view of the world (rather than discriminative or generative) leads to strange artifacts like images tessellating a pattern, unCLIP seems to cripple DALL-E 2 in some ways like compositionality is worsened. I don’t really get the unCLIP approach so I’m not completely sure why it’d screw up text. The paper speculates that
it is possible that the CLIP embedding does not precisely encode spelling information of rendered text. This issue is likely made worse because the BPE encoding we use obscures the spelling of the words in a caption from the model, so the model needs to have independently seen each token written out in the training images in order to learn to render it.
It may also be partially a dataset issue: OA’s licensing of commercial datasets may have greatly underemphasized images which have text in them, which tends to be more of a dirty Internet or downstream user thing to have. If it’s unCLIP, raw GLIDE should be able to do text much better. If it’s the training data, it probably won’t be much different.
If it’s the training data, it’ll be easy to fix if OA wants to fix it (like anime/faces); OA can find text-heavy datasets, or simply synthesize the necessary data by splatting random text in random fonts on top of random images & training etc. If it’s unCLIP, it can be hacked around by letting the users bypass unCLIP to use raw GLIDE, which as far as I know, they have no ability to do at the moment. (Seems like a very reasonable option to offer, if only for other purposes like research.) A longer-term solution would be to figure out a better unCLIP which avoids these contrastive pathologies, and a longer-term still solution would be to simply scale up enough that you no longer need this weird unCLIP thing to get diverse but high-quality samples, the base models are just good enough.
So this might be relatively easy to fix, or have an obvious fix but won’t be implemented for a long time.
complex scenes: this one is easy—unCLIP is screwing things up.
The problem with these samples generally doesn’t seem to be that the objects rendered are rendered badly by GLIDE or the upscalers, the problem seems to be that the objects are just organized wrong because the DALL-E 2 system as a whole didn’t understand the text input—that is, CLIP gave GLIDE the wrong blueprint and that is irreversible. And we know that GLIDE can do these things better because the paper shows us how much better one pair of prompts are (no extensive or quantitative evaluation, however):
In Figure 14, we find that unCLIP struggles more than GLIDE with a prompt where it must bind two separate objects (cubes) to two separate attributes (colors). We hypothesize that this occurs because the CLIP embedding itself does not explicitly bind attributes to objects, and find that reconstructions from the decoder often mix up attributes and objects, as shown in Figure 15.
And it’s pretty obvious that it almost has to screw up like this if you want to take the approach of a contrastively-learned fixed-size embedding (Nostalgebraist again): a fixed-size embedding is going to struggle if you want to stack on arbitrarily many details, especially without any recurrency or iteration (like DALL-E 1 had in being a Transformer on text inputs + VAE-token outputs). And a contrastive model like CLIP isn’t going to do relationships or scenes as well as it does other things because it just doesn’t encounter all that many pairs of images where the objects are all the same but their relationship is different as specified by the text caption, which is the sort of data which would force it to learn how “the red box is on top of the blue box” looks different from “the blue box is on top of the red box”.
Like before, just offering GLIDE as an option would fix a lot of the problems here. unCLIP screws up your complex scene? Do it in GLIDE. The GLIDE is hard to guide or lower-quality? Maybe seed it in GLIDE and then jazz it up in the full DALL-E 2.
Longer-term, a better text encoder would go a long way to resolving all sorts of problems. Just existing text models would be enough, no need for hypothetical new archs. People are accusing DALL-E 2 of lacking good causal understanding or not being able to solve problems of various sorts; fine, but CLIP is a very bad way to understand language, being a very small text encoder (base model: 0.06b) trained contrastively from scratch on short image captions rather than initialized from a real autoregressive language model. (Remember, OA started the CLIP research with autoregressive generation, Figure 2 in the CLIP paper, it just found that more expensive, not worse, and switched to CLIP.) A real language model, like Chinchilla-80b, would do much better when fused to an image model, like in Flamingo.
So, these DALL-E 2 problems all look soluble to me by pursuing just known techniques. They stem from either deliberate choices, removing the user’s ability to choose a different tradeoff, or lack of simple-minded scaling.
* On skimming, CogView2 looks like it’d avoid most of the DALL-E 2 pathologies, but looks like it’s noticeably lower-quality in addition to lower-resolution.
EDIT: between Imagen, Parti, DALL-E 3, and the miracle-of-spelling paper, I think that my claims that text in images is simply a matter of scale, and that tokenization screws up text in images, are now fairly consensus in DL as of late 2023.
Google Brain just announced Imagen (Twitter), which on skimming appears to be not just as good as DALL-E 2 but convincingly better. The main change appears to be reducing the CLIP reliance in favor of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on “compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts.” The samples also show text rendering fine inside the images as well.
I take this as strong support (already) for my claims 2-3: the problems with DALL-E 2 were not major or deep ones, do not require any paradigm shift to fix, or even any fix, really, beyond just scaling the components almost as-is. (In Kuhnian terms, the differences between DALL-E 2 and Imagen or Make-A-Scene are so far down in the weeds of normal science/engineering that even people working on image generation will forget many of the details and have to double-check the papers.)
EDIT: Google also has a more traditional autogressive DALL-E-1-style 1024px model, “Parti”, competing with diffusion Imagen; it is slightly better in COCO FID than Imagen. It likewise does well on all those issues, with again no special fancy engineering aimed specifically at those issues, mostly just scaling up to 20b.
Swimmer963 highlights DALL-E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still extremely good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:
In general, we have not topped out on pretty much any scaling curve. Whether it’s language modeling, image generation, DRL, or whathaveyou, AFAIK, not a single modality can be truly said to have been ‘solved’ with the scaling curve broken. Either the scaling curve is flat, or we’re still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) Diffusion models’ only scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla on them yet.
So yes, we definitely can just make all the compute-budgets 10x larger without wasting it.
To go through the specific issues (caveat: we don’t know if Make-A-Scene solves any of these because no one can use it; and I have not read the Cogview2 paper*):
anime & realistic faces are purely self-imposed problems by OA. DALL-E 2 will do them fine just as soon as OA wants it to, and other models by other orgs do just fine on those domains. So no real problem there.
text in images: this is an odd one. This is especially odd because it destroys the commercial application of any image with text in it (because it’s garbage—who’d pay for these?), and if you go back to DALL-E 1, one of the demos was it putting text into images like onto generated teapots or storefronts. It was imperfect, but DALL-E 2 is way worse at it, it looks like. I mean, DALL-E 1 would’ve at least spelled ‘Avengers’ correctly. Nostalgebraist has also shown you can get excellent text generation with a specialized small model, and people using compviz (also much smaller than DALL-E 2) get good text results. So text in images is not intrinsically hard, this is a DALL-E 2-specific problem, whatever it is.
Why? As Nostalgebraist discusses at length in his earlier post, the unCLIP approach to using GLIDE to create the DALL-E 2 system seems to have a lot of weird drawbacks and tradeoffs. Just as CLIP’s contrastive view of the world (rather than discriminative or generative) leads to strange artifacts like images tessellating a pattern, unCLIP seems to cripple DALL-E 2 in some ways like compositionality is worsened. I don’t really get the unCLIP approach so I’m not completely sure why it’d screw up text. The paper speculates that
Damn you BPEs! Is there nothing you won’t blight?!
It may also be partially a dataset issue: OA’s licensing of commercial datasets may have greatly underemphasized images which have text in them, which tends to be more of a dirty Internet or downstream user thing to have. If it’s unCLIP, raw GLIDE should be able to do text much better. If it’s the training data, it probably won’t be much different.
If it’s the training data, it’ll be easy to fix if OA wants to fix it (like anime/faces); OA can find text-heavy datasets, or simply synthesize the necessary data by splatting random text in random fonts on top of random images & training etc. If it’s unCLIP, it can be hacked around by letting the users bypass unCLIP to use raw GLIDE, which as far as I know, they have no ability to do at the moment. (Seems like a very reasonable option to offer, if only for other purposes like research.) A longer-term solution would be to figure out a better unCLIP which avoids these contrastive pathologies, and a longer-term still solution would be to simply scale up enough that you no longer need this weird unCLIP thing to get diverse but high-quality samples, the base models are just good enough.
So this might be relatively easy to fix, or have an obvious fix but won’t be implemented for a long time.
complex scenes: this one is easy—unCLIP is screwing things up.
The problem with these samples generally doesn’t seem to be that the objects rendered are rendered badly by GLIDE or the upscalers, the problem seems to be that the objects are just organized wrong because the DALL-E 2 system as a whole didn’t understand the text input—that is, CLIP gave GLIDE the wrong blueprint and that is irreversible. And we know that GLIDE can do these things better because the paper shows us how much better one pair of prompts are (no extensive or quantitative evaluation, however):
And it’s pretty obvious that it almost has to screw up like this if you want to take the approach of a contrastively-learned fixed-size embedding (Nostalgebraist again): a fixed-size embedding is going to struggle if you want to stack on arbitrarily many details, especially without any recurrency or iteration (like DALL-E 1 had in being a Transformer on text inputs + VAE-token outputs). And a contrastive model like CLIP isn’t going to do relationships or scenes as well as it does other things because it just doesn’t encounter all that many pairs of images where the objects are all the same but their relationship is different as specified by the text caption, which is the sort of data which would force it to learn how “the red box is on top of the blue box” looks different from “the blue box is on top of the red box”.
Like before, just offering GLIDE as an option would fix a lot of the problems here. unCLIP screws up your complex scene? Do it in GLIDE. The GLIDE is hard to guide or lower-quality? Maybe seed it in GLIDE and then jazz it up in the full DALL-E 2.
Longer-term, a better text encoder would go a long way to resolving all sorts of problems. Just existing text models would be enough, no need for hypothetical new archs. People are accusing DALL-E 2 of lacking good causal understanding or not being able to solve problems of various sorts; fine, but CLIP is a very bad way to understand language, being a very small text encoder (base model: 0.06b) trained contrastively from scratch on short image captions rather than initialized from a real autoregressive language model. (Remember, OA started the CLIP research with autoregressive generation, Figure 2 in the CLIP paper, it just found that more expensive, not worse, and switched to CLIP.) A real language model, like Chinchilla-80b, would do much better when fused to an image model, like in Flamingo.
So, these DALL-E 2 problems all look soluble to me by pursuing just known techniques. They stem from either deliberate choices, removing the user’s ability to choose a different tradeoff, or lack of simple-minded scaling.
* On skimming, CogView2 looks like it’d avoid most of the DALL-E 2 pathologies, but looks like it’s noticeably lower-quality in addition to lower-resolution.
EDIT: between Imagen, Parti, DALL-E 3, and the miracle-of-spelling paper, I think that my claims that text in images is simply a matter of scale, and that tokenization screws up text in images, are now fairly consensus in DL as of late 2023.
Google Brain just announced Imagen (Twitter), which on skimming appears to be not just as good as DALL-E 2 but convincingly better. The main change appears to be reducing the CLIP reliance in favor of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on “compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts.” The samples also show text rendering fine inside the images as well.
I take this as strong support (already) for my claims 2-3: the problems with DALL-E 2 were not major or deep ones, do not require any paradigm shift to fix, or even any fix, really, beyond just scaling the components almost as-is. (In Kuhnian terms, the differences between DALL-E 2 and Imagen or Make-A-Scene are so far down in the weeds of normal science/engineering that even people working on image generation will forget many of the details and have to double-check the papers.)
EDIT: Google also has a more traditional autogressive DALL-E-1-style 1024px model, “Parti”, competing with diffusion Imagen; it is slightly better in COCO FID than Imagen. It likewise does well on all those issues, with again no special fancy engineering aimed specifically at those issues, mostly just scaling up to 20b.