Making DALL-E Count

Can DALL-E count? Does DALL-E’s counting ability depend on what it is counting? Let’s find out!

I only generate the images once and paste them in along with the prompt in quotations.

Numbers

“Zero”

“One”

“Two”

“Three”

We have three leaves, but I’m guessing DALL-E is noticing that “three” rhymes with “tree.”

“Four”

I guess DALL-E just thinks three is a better number than four.

“Five”

“Six”

If an ace is treated as 1, and we add the five dots on the dice, then maybe that counts?

“Seven”

“Eight”

DALL-E is very into 8.

“Nine”

“Ten”

“One hundred”

Digits

“0”

“1”

“2”

“3”

“4”

“5”

“6”

“7”

“8”

“9”

“10”

“100”

Cats

“Zero cats”

“One cat”

“Two cats”

“Three cats”

“Four cats”

“Five cats”

“Six cats”

“Seven cats”

“Eight cats”

“Nine cats”

“Ten cats”

“One hundred cats”

Reroll-eight experiment

I notice that DALL-E seems to be pretty reliable at generating 1-3 of something, and that it seems to prefer spelled-out numbers to digits. It was very successful at generating the number eight, which I suspected was due to the influence of images of magic 8-balls. It only managed to get eight cats once, maybe by accident. With my last 7 free credits, I decided to see if DALL-E had at least some sense of what it means to count to eight, as opposed to representing parts of images associated with the word “eight.”

I therefore decided to generate 7 panels of images of various objects, count the number of those objects contained in each image, plot the result, and see if it was at least centered around the number 8. The nouns were chosen with a random noun generator, and I selected the first 7 that seemed “easy to draw.” The resulting prompts were “eight paintings,” “eight hospitals,” “eight poets,” “eight historians,” “eight speakers,” “eight women,” and “eight baskets.”

The way I classified the number of objects in a given drawing is in the caption.

“Eight paintings”

15, 9, 9, 9

“Eight hospitals”

8, 9, 8, 5

“Eight poets”

4, 20, 5, 8

“Eight historians”

8, 9, 13, 10

“Eight speakers”

6, 5, 3, 3

“Eight women”

12, 8, 20, 9

“Eight baskets”

5, 8, 9, 7

Results for reroll-8 experiment

Discussion on reroll-8 experiment

The reroll-eight experiment did generate 8 objects about 20% of the time.

However, I think it’s interesting that only one occurrence of 7 of an object occurred, while 9 objects were even more common than 8 objects.

This suggests to me that DALL-E has an heavy bias toward the number 9, perhaps because a 3x3 grid is a common pattern. DALL-E seems to flip over into “grid mode” or “arrangement mode” once it has decided that it needs to display more than about 5 of an object, and needs some structure for their visual composition.

Sometimes, DALL-E gets lucky and happens to choose a structure that allows it to put in 8 objects: a lineup, a semi-structured display, two rows of four, four pairs of two, a cloud, objects in a ring.

Psychologically, this makes me think about cultures in which counting is limited and imprecise (“one, two, three… many”). DALL-E hasn’t been trained to rigidly distinguish between numbers in general. It has only been trained to distinguish between numbers that are “compositionally relevant,” like the difference between 1 and 2.

Compositional reroll-8 experiment

I wanted to test this theory by seeing if DALL-E could reliably generate 8 of something if the count was specified compositionally rather than numerically. So I bought more credits.

My best idea was “An octopus holding eight [noun], one in each leg.”

I didn’t want to generate a second point of confusion for DALL-E by asking it to generate things that can’t conceivably be held in one’s hands, like hospitals. So I decided to replace the nouns with things that fit in the hands: paintings, baskets, cigarettes, newspapers, salads, coffees, and things.

“An octopus holding eight paintings, one in each leg.”

8, 6, 8, 8

“An octopus holding eight baskets, one in each leg.”

5, 2, 5, 6

“An octopus holding eight cigarettes, one in each leg.”

5, 4, 8, 5

“An octopus holding eight newspapers, one in each leg.”

2, 4, 3, 4

“An octopus holding eight salads, one in each leg.”

10, 4, 3, 6

“An octopus holding eight coffees, one in each leg.”

7, 5, 6, 4

“An octopus holding eight things, one in each leg.”

9, 6, 6, 8

Discussion of compositional reroll-8 experiment

Here, we don’t see any “pull toward 9.” I’m guessing that octopus arms don’t correspond to the “grid representation.” We see a spike at 8, and then about even numbers of 4, 5, and 6 items. This could mean that DALL-E is “torn” between count and composition, or perhaps that there is a bimodal distribution of octopodal compositions in the training data—some with 4-6 items in hand, others with 8.

I tried doing a “reroll-12” experiment, replacing the numbers on a clock with 12 paintings or 12 baskets. DALL-E generates clocks textured like baskets or spattered with paint, or with baskets next to the clock, but nothing like what I was imagining.

This experiment persuades me that DALL-E can’t count. DALL-E can compose. It understands a relationship between number-containing prompts and shapes that we recognize as digits, or between number-containing prompts and arrangements of objects that correspond to those numbers. For example, prompts that contain “nine” or “9“ often have grids, frequently 3x3 grids. Prompts that contain “eight” or “8” also often contain grids, and since grids are often in a 3x3 shape, images containing 9 objects are also associated with prompts containing the word “eight.”

This pushes me somewhat toward a concept of human psychology in which our brains are composed of a large assemblage of specialized training modules for a variety of tasks. These training modules are interconnected. For example, those of us who received an education in arithmetic have it available for use in a wide variety of tasks. Learning “when to apply arithmetic” is probably also a specialized module.

This suggests to me that advanced AI will come from designing systems that can learn new, discrete tasks (addition, handling wine glasses, using tone of voice to predict what words a person will say). It will then need to be able to figure out when and how to combine then in particular contexts in order to achieve results in the world. My guess is that children do this by open-ended copying—trying to figure out some aspect of adult behavior that’s within their capabilities, and them copying it with greater fidelity, using adult feedback to guide their behavior, until they succeed and the adult signals that they’re “good enough.”

Pedagogically, this makes me suspect that even adults need to have a much greater component of blind copying when they’re trying to learn a new skill. I often have great difficulty learning new math and engineering skills until I’ve had the chance to watch somebody else work through a significant number of problems using the techniques we’ve just learned about. Even reading the descriptions in our textbooks carefully doesn’t make it “click.” That only tells me what equations to use, and what the arguments for them are. To make use of them, I have to see them being used, and sort of “think along with” the person solving them, until I’m able to predict what they’ll do, or retrace what they just did and why they did it.

Eventually, the generalized patterns underpinning their behaviors come together, and I’m able to solve novel problems.

This makes me think, then, that math and engineering students would benefit greatly from receiving large volumes of problems with step-by-step solutions. They’d “solve” these problems along with the author. Perhaps first, they’d read the author’s solution. Then they’d try to reconstruct it for themselves. Once they can solve the problem on their own, without reference to the author’s original work, they’d move on to the next problem. Eventually, they’d try solving problems on their own.