The surprising parameter efficiency of vision models

Crossposted from my personal blog.

Epistemic status: This is a short post meant to highlight something I do not yet understand and therefore a potential issue with my models. I would also be interested to hear if anybody else has a good model of this.

Why do vision (and audio) models work so well despite being so small? State of the art models like stable diffusion and midjourney work exceptionally well, generating near-photorealistic art and images and give users a fair degree of controllability over their generations. I would estimate with a fair degree of confidence that the capabilities of these models probably surpass the mental imagery abilities of almost all humans (they definitely surpass mine and a number of people I have talked to). However, these models are also super small in terms of parameters. The original stable diffusion is only 890M parameters.

In terms of dataset size, image models are at a rough equality with humans. The stable diffusion dataset is 2 billion images. Assuming that you see 10 images per second every second you are awake and that you are awake 18 hours a day, you can observe 230 million images per year and so get the same data input as stable diffusion after 10 years. Of course, the images you see are much more redundant and we made some highly aggressive assumptions but after a human lifetime being in the same OOM as a SOTA image model is not insane. On the other hand, the hundreds of billions to trillions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.

A similar surprising smallness occurs in audio models. OpenAI’s Whisper can do almost flawless audio transcription (including multilingual translation!) with just 1.6B parameters.

Let’s contrast this to the brain. Previously, I estimated that we should expect the visual cortex to have on the order of 100B parameters, if not more. The auditory cortex should be of roughly the same order of magnitude, but slightly smaller than the visual cortex. That is two orders of magnitude larger than state of the art DL models in these modalities.

This contrasts with state of the art language models which appear to be approximately equal to the brain in parameter count and abilities. Small (1-10B) language models are clearly inferior to the brain at producing valid text and completions as well as standard question answering and factual recall tasks. Human parity in factual knowledge is reached somewhere between GPT-2 and GPT-3. Human language abilities are still not entirely surpassed with GPT-3 (175B parameters) or GPT-4 (presumably significantly larger). This puts large language models within approximately the same order of magnitude as the human linguistic cortex.

What could be the reasons for this discrepancy? Off the top of my head I can think of a number which are below (and ranked by rough intuitive plausibility), and it would be interesting to try to investigate these further. Also, if anybody has ideas or evidence either way please send me a message.

1.) The visual cortex vs image models is not a fair comparison. The brain does lots of stuff image generation models can’t do such as parse and render very complex visual scenes, deals with saccades and having two eyes, and, crucially, handle video data and moving stimuli. We haven’t fully cracked video yet and it is plausible that to do so existing vision models require an OOM or two more of scale.

2.) There are specific inefficiencies in the brain’s processing of images that image models skip which do not apply to language models. One very obvious example of this is convolutions. While CNNs have convolutional filters which are applied to all tiles of the image individually, the brain cannot do this and so must laboriously have separate neurons and synapses encode each filter. Indeed, much of the processing in the retina, lateral geniculate nucleus, and even V1 appears to be taken up with extremely simple filters (such as Gabors, edge detectors, line detectors etc) copied over and over again for different image patches. This ‘artificially’ inflates the parameter count of the visual cortex vs ML models such that the visual cortex’ ‘effective parameter count’ is much smaller than appears. However, I doubt this can be the whole story as recent image models such as stable diffusion use increasingly transformer-like architectures (residual stream + attention) rather than convolutions for most of the image processing pipeline. Similarly, Whisper only has 1 conv block at the beginning before transitioning into an attention based architecture.

3.) Parameter count is the wrong way to assess diffusion models. Unlike feedforward NNs such as transformers or earlier vision models such as GANs/​VAEs, diffusion models generate (and are trained) using a reasonably large number of diffusion steps to iteratively ‘decode’ an image. This process is very similar to the iterative inference via recurrence that occurs in the brain. However, unlike diffusion models, the brain supports a single feedforward amortized sweep to achieve core object recognition (otherwise your vision would be too slow to detect important things such as predators in time). It is possible that the iterative inference supported by diffusion models is more parameter efficient than a direct amortized net would be, and thus gets a saving over the brain in this way. While there are very good VAEs/​GANs in existence and at scale, it may be that these need to have an OOM or more parameters to be competitive with diffusion models. Note that in terms of computational cost, since a forward pass through an amortized net is so much cheaper than a generation with a diffusion network (a diffusion network generation is effectively N amortized forward passes where N is the number of diffusion steps) then comparable VAEs/​GANs may actually be cheaper to run even if much larger.

4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same ‘cognitive/​linguistic power’ but that since GPT-2′s cognition is spread over a much wider data range than a human, it’s ‘linguistic power density’ is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.

5.) Language models are highly inefficient and can be made much smaller without sacrificing much performance. For whatever reason, we may just be training language models badly or doing something else wrong and it is in fact possible to get 1 or 2 OOMs of parameter efficiency out of current language models. If this were true, it would be massive since it would shrink a GPT-4 level model into a trivially open-sourceable and highly hackable ‘small’ LLM. For instance, GPT-4 is unlikely to be more than 1 trillion dense parameters. Two orders of magnitude would shrink it to a 10B model, approximately the same sizes as the Llama 11B and smaller than neox-20B, and which would be straightforward to inference on even consumer-grade cards. There is some evidence for this in reasonably large amounts of pruning being possible, but to me it seems that an actual 2 OOM shrinking is unlikely.