Estimating Brain-Equivalent Compute from Image Recognition Algorithms

[Epistemic Status: Playing around with the idea of a benchmark with some rough numbers.]

When I read Biological Anchors: A Trick That Might Or Might Not Work, my thinking was: Biological anchors will work if your algorithms are close enough to what the brain does and can then be used to estimate the compute (FLOPs) needed for the rest of the brain. The compute equivalent of the brain has been discussed recently here (I think this indicates factor 100 more efficient algorithms) and here. I used this for predictions on Metaculus. This will not give you sharp bounds and not tell you whether algorithms could do things much cheaper or which ones to use. I have not seen this specific comparison elsewhere.

This started with the idea that we might already have some algorithms that perform as well as some parts of the brain and compare their costs, power requirements, and complexity. Specifically, image recognition is about equally good as human raters. Thus, let’s compare state-of-the-art image recognition algorithms with the corresponding brain regions (the visual cortex) and then extrapolate that to the whole brain.

I did this and here is the result:

Brain RegionBrodmann Area 17 Visual Cortex V1AlgorithmCoAtNet-7
[cm^3] Volume11 (1% brain volume)
[10^6] Neurons280 (0.3% brain neurons) [10^6] Parameters2500
[W] Power0.18[W] Power, 10 inferences/​s13 at 2TFLOPs/​W
[10^21 FLOP] training200
[10^9 FLOP] inference2600

Whether the comparison should include only region V1 or also V2 to V5 of the visual cortex is worth asking, but the idea was to estimate conservatively and to exclude cognitive processes current algorithms definitely don’t cover.

Extrapolating the compute to the whole brain:

  • Inference: 8*10^15 FLOPs/​s (86*10^9 neurons /​ 280*10^6 neurons * 2.6*10^12 FLOPs/​inference * 10 inferences/​second).

  • Training: 5*10^16 FLOP/​s (86*10^9 neurons /​ 280*10^6 neurons * 2*10^23 FLOPs /​ 18 life years).

Pretty low compared to the numbers in Cotra’s paper.

There are just some problems: The visual cortex does much more than just static image recognition of 512x512 pixel images:

  • The resolution of the processed image is much higher: 120 million rods instead of a quarter-million pixels.

  • Stitching together the picture from blurred fragments (Saccades).

  • Building something like a 3D model (maybe not in V1, though).

  • Inferring actions in the scene change over time (mostly motions; object permanence). Some of this may be in the V2-V5 regions.

Unfortunately, I only realized this when I had already collected most of the above data. There is algorithmic progress on many of these points (e.g., there is active research in vision-based action detection), but no algorithms come close to human performance on these. Alternatively, I also tried to get corresponding numbers for auditory processing, but these were harder to get, and speech recognition also hasn’t reached human parity yet (cocktail party effect). Thus my initial assumption—that we have a brain region algorithmically covered—doesn’t hold up. I considered not posting this write-up but then decided that it might still be of interest to some readers.