Regarding the “a green stop sign in a field of red flowers” images, this does not at all make me update downwards on the capabilities of a maximally capable AI using modern techniques and equipment.
This is because this failure mode is a symptom of how CLIP (one of the components of DALL-E 2) is trained, and would likely be fixed by a small tweak to the training process of CLIP.
CLIP is trained by showing the model an image and a piece of text, and asking if the image and the text are a good fit. CLIP ouputs a list of many (hundreds of) numbers, describing different attributes, for both the text and the image. If the text is actually connected to the image, then the model is “rewarded” for making the same attributes big (or small) for both the text and the image; but if the text belongs to a different image, it is “punished” for making the numbers similar.
Importantly, there are many different types of images in the training set, with all kinds of different captions belonging to them. This means that almost always, if CLIP is being shown an image that does not fit the text, the image and the text are wildly different from eachother. It’s more “a drawing of a great white shark” vs “a green stop sign against a field of red flowers”, rather than “green sign, red flowers” vs “red sign, green flowers”. (Note, there was no effort to try to “trick” CLIP by grouping similar images together in the training, to help it learn to distinguish them)
So CLIP knows the difference between “a field of red flowers” and “a cartoon of a dancing coyote”, but barely knows the difference between “a red stop sign against a green background” and “a green stop sign against a red background”. It vaguely knows the difference, but only barely, because it basically never had to tell such similar inputs apart.
So, 99.9% (I’m estimating that) of the time CLIP saw a training example where the text and image were at least as similar as “a green stop sign in a field of red flowers” and the worst match of the images seen above, they actually were a good match for eachother. So when DALL-E 2 draws a red stop sign, and asks CLIP if it got it right, CLIP just shrugs its shoulders, and says, “yeah, I guess”, so DALL-E feels happy returning that image (note: DALL-E doesn’t actually work like that).
But it’d be incredibly easy to train CLIP in such a way that it has to tell the difference between similar images much more often, and if that happened, I would expect this failure mode to immediately go away. So this example tells us nothing when trying to forecast the capabilities of future models.
Regarding the “a green stop sign in a field of red flowers” images, this does not at all make me update downwards on the capabilities of a maximally capable AI using modern techniques and equipment.
This is because this failure mode is a symptom of how CLIP (one of the components of DALL-E 2) is trained, and would likely be fixed by a small tweak to the training process of CLIP.
CLIP is trained by showing the model an image and a piece of text, and asking if the image and the text are a good fit. CLIP ouputs a list of many (hundreds of) numbers, describing different attributes, for both the text and the image. If the text is actually connected to the image, then the model is “rewarded” for making the same attributes big (or small) for both the text and the image; but if the text belongs to a different image, it is “punished” for making the numbers similar.
Importantly, there are many different types of images in the training set, with all kinds of different captions belonging to them. This means that almost always, if CLIP is being shown an image that does not fit the text, the image and the text are wildly different from eachother. It’s more “a drawing of a great white shark” vs “a green stop sign against a field of red flowers”, rather than “green sign, red flowers” vs “red sign, green flowers”. (Note, there was no effort to try to “trick” CLIP by grouping similar images together in the training, to help it learn to distinguish them)
So CLIP knows the difference between “a field of red flowers” and “a cartoon of a dancing coyote”, but barely knows the difference between “a red stop sign against a green background” and “a green stop sign against a red background”. It vaguely knows the difference, but only barely, because it basically never had to tell such similar inputs apart.
So, 99.9% (I’m estimating that) of the time CLIP saw a training example where the text and image were at least as similar as “a green stop sign in a field of red flowers” and the worst match of the images seen above, they actually were a good match for eachother. So when DALL-E 2 draws a red stop sign, and asks CLIP if it got it right, CLIP just shrugs its shoulders, and says, “yeah, I guess”, so DALL-E feels happy returning that image (note: DALL-E doesn’t actually work like that).
But it’d be incredibly easy to train CLIP in such a way that it has to tell the difference between similar images much more often, and if that happened, I would expect this failure mode to immediately go away. So this example tells us nothing when trying to forecast the capabilities of future models.
I don’t think that’s true.
(I’m choosing not to respond to this comment for reasons relating to potential info-hazards)