Deep learning systems require huge amounts of data to approach human-level generalizations. This indicates, to an extent, that what’s learned from a single example is “shallow”. Perhaps this could be seen as closer to plagiarism.
The lawsuit against Stable Diffusion argues that SD works by amassing a huge library of images that the system then interpolates between in order to generate the desired kinds of images, but struggles to create the kinds of image combinations that don’t appear in the training data and thus can’t be interpolated between. Some of my friends have also remarked on this, e.g. that there are many contexts where it’s a struggle to get the system to draw women in a non-sexualized way. (See also Scott Alexander on the way that DALL-E conflates style and content.)
This is then different from the kind of learning that a human artist does—humans don’t only store a huge library of reference photos in their mind and interpolate between them, but they actually get a conceptual understanding of the world as well. Because of that, they could easily draw pictures even of things they’ve never seen before (“a dog wearing a baseball cap while eating ice cream” is the example used in the complaint). In contrast, systems like Stable Diffusion are limited to only being able to draw things that are a sufficiently close match to images they’ve already seen. In that sense, a human artist who draws the kind of a picture that would otherwise not have existed in SD’s training set is much more directly enabling the system to draw those kinds of pictures, than they would be enabling another human artist to do the same. (Or so the argument goes.)
Ho showed how a latent image could be interpolated—meaning, blended mathematically—to produce new derivative images. Rather than combine two images pixel by pixel—which gives unappealing results—Ho showed how Training Images can be stored in the diffusion model as latent images and then interpolated as a new latent image. This interpolated latent image can then be converted back into a standard pixel-based image.
The diagram below, taken from Ho’s paper, shows how this process works, and demonstrates the difference in results between interpolating pixels and interpolating latent images.
In the diagram, two photos are being blended: the photo on the left labeled “Source x0,” and the photo on the right labeled “Source x′0.”
The image in the red frame has been interpolated pixel by pixel, and is thus labeled “pixel-space interpolation.” This pixel-space interpolation simply looks like two translucent face images stacked on top of each other, not a single convincing face.
The image in the green frame, labeled “denoised interpolation”, has been generated differently. In that case, the two source images have been converted into latent images (illustrated by the crooked black arrows pointing upward toward the label “Diffused source”). Once these latent images have been interpolated (represented by the green dotted line), the newly interpolated latent image (represented by the smaller green dot) has been reconstructed into pixels (a process represented by the crooked green arrow pointing downward to a larger green dot). This process yields the image in the green frame. Compared to the pixel-space interpolation, the difference is apparent: the denoised blended interpolation looks like a single convincing human face, not an overlay or combination of images of two faces. [...]
Despite the difference in results, these two modes of interpolation are equivalent: they both generate derivative works from the source images. In the pixel-space interpolation (the red-framed image), the source images themselves are being directly interpolated to make a derivative image. In the denoised interpolation (the green-framed image), (1) the source images are being converted to latent images, which are lossy-compressed copies; (2) those latent images are being interpolated to make a derivative latent image; and then (3) this derivative latent image is decompressed back into a pixel-based image.
In April 2022, the diffusion technique was further improved by a team of researchers led by Robin Rombach at Ludwig Maximilian University of Munich. These ideas were introduced in his paper “High-Resolution Image Synthesis with Latent Diffusion Models.”
Rombach is also employed by Stability as one of the primary developers of Stable Diffusion, which is a software implementation of the ideas in his paper.
Rombach’s diffusion technique offered one key improvement over previous efforts. Rombach devised a way to supplement the denoising process by using extra information, so that latent images could be interpolated in more complex ways. This process is called conditioning. The most common tool for conditioning is short text descriptions, previously introduced as Text Prompts, that might describe elements of the image, e.g.—“a dog wearing a baseball cap while eating ice cream”. This metric uses Text Prompts as conditioning data to select latent images that are already associated with text captions indicating they contain “dog,” “baseball cap,” and “ice cream.” The text captions are part of the Training Images, and were scraped from the websites where the images themselves were found.
The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool.
The result of this conditioning process may or may not be a satisfying or accurate depiction of the Text Prompt. Below is an example of output images from Stable Diffusion (via the DreamStudio app) using this Text Prompt—“a dog wearing a baseball cap while eating ice cream”. All these dogs in the resulting images seem to be wearing baseball hats. Only the one in the lower left seems to be eating ice cream. The two on the right seem to be eating meat, not ice cream.
In general, none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data. This stands to reason: the use of conditioning data to interpolate multiple latent images means that the resulting hybrid image will not look exactly like any of the Training Images that have been copied into those latent images.
But it is also true that the only thing a latent-diffusion system can do is interpolate latent images into hybrid images. There is no other source of visual information entering the system.
Every output image from the system is derived exclusively from the latent images, which are copies of copyrighted images. For these reasons, every hybrid image is necessarily a derivative work.
A latent-diffusion system can never achieve a broader human-like understanding of terms like “dog,” “baseball hat,” or “ice cream.” Hence, the use of the term “artificial intelligence” in this context is inaccurate.
A latent-diffusion system can only copy from latent images that are tagged with those terms. The system struggles with a Text Prompt like “a dog wearing a baseball cap while eating ice cream” because, though there are many photos of dogs, baseball caps, and ice cream among the Training Images (and the latent images derived from them) there are unlikely to be any Training Images that combine all three.
A human artist could illustrate this combination of items with ease. But a latentdiffusion system cannot because it can never exceed the limitations of its Training Images.
In practice, the quality of the latent-diffusion images depends entirely on the breadth and quality of the Training Images used to generate the latent images. If that weren’t true, then it wouldn’t matter where Stable Diffusion (or any other AI-Image Product) got its Training Images.
In actuality, the provenance of an AI-Image-Product’s Training Images matters very much. According to Emad Mostaque, CEO of Stability, Stable Diffusion has “compress[ed] the knowledge of over 100 terabytes of images.” Though the rapid success of Stable Diffusion has been partly reliant on a great leap forward in computer science, it has been even more reliant on a great leap forward in appropriating copyrighted images.
What’s amusing is before this case ever even sees a trial, the above limitations may be overcome. Feedback from a system that checks the output image actually satisfies the prompt and that humans have the correct number of fingers for instance.
Interestingly i believe this is a limitation that one of the newest (as yet unreleased) diffusion models has overcome, called DeepFloyd; a number of examples have been teased already, such as the following Corgi sitting in a sushi doghouse:
The lawsuit against Stable Diffusion argues that SD works by amassing a huge library of images that the system then interpolates between in order to generate the desired kinds of images, but struggles to create the kinds of image combinations that don’t appear in the training data and thus can’t be interpolated between. Some of my friends have also remarked on this, e.g. that there are many contexts where it’s a struggle to get the system to draw women in a non-sexualized way. (See also Scott Alexander on the way that DALL-E conflates style and content.)
This is then different from the kind of learning that a human artist does—humans don’t only store a huge library of reference photos in their mind and interpolate between them, but they actually get a conceptual understanding of the world as well. Because of that, they could easily draw pictures even of things they’ve never seen before (“a dog wearing a baseball cap while eating ice cream” is the example used in the complaint). In contrast, systems like Stable Diffusion are limited to only being able to draw things that are a sufficiently close match to images they’ve already seen. In that sense, a human artist who draws the kind of a picture that would otherwise not have existed in SD’s training set is much more directly enabling the system to draw those kinds of pictures, than they would be enabling another human artist to do the same. (Or so the argument goes.)
From the complaint:
What’s amusing is before this case ever even sees a trial, the above limitations may be overcome. Feedback from a system that checks the output image actually satisfies the prompt and that humans have the correct number of fingers for instance.
That’s horrifying
Interestingly i believe this is a limitation that one of the newest (as yet unreleased) diffusion models has overcome, called DeepFloyd; a number of examples have been teased already, such as the following Corgi sitting in a sushi doghouse:
https://twitter.com/EMostaque/status/1615884867304054785?t=jmvO8rvQOD1YJ56JxiWQKQ&s=19
As such the quoted paragraphs surprised me as an instance of a straightforwardly falsifiable claim in the documents.