He writes that the human brain has “1e13-1e15 spikes through synapses per second (1e14-1e15 synapses × 0.1-1 spikes per second)”. I think Joe was being overly conservative, and I feel comfortable editing this to “1e13-1e14 spikes through synapses per second”, for reasons in this footnote→[9].
I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops.
So when you say:
So I feel like 1e14 FLOP/s is a very conservative upper bound on compute requirements for AGI. And conveniently for my narrative, that number is about the same as the 8.3e13 FLOP/s that one can perform on the RTX 4090 retail gaming GPU that I mentioned in the intro.
You are missing some foundational differences in how von neumann arch machines (GPUs) run neural circuits vs how neuromorphic hardware (like the brain) runs neural circuits.
The 4090 can hit around 1e14 - even up to 1e15 - flops/s, but only for dense matrix multiplication. The flops required to run a brain model using that dense matrix hardware are more like 1e17 flops/s, not 1e14 flops/s. The 1e14 synapses are at least 10x locally sparse in the cortex, so dense emulation requires 1e15 synapses (mostly zeroes) running at 100hz. The cerebellum is actually even more expensive to simulate .. because of the more extreme connection sparsity there.
But that isn’t the only performance issue. The GPU only runs matrix matrix multiplication, not the more general vector matrix multiplication. So in that sense the dense flop perf is useless, and the perf would instead be RAM bandwidth limited and require 100 4090′s to run a single 1e14 synapse model—as it requires about 1B of bandwidth per flop—so 1e14 bytes/s vs the 4090′s 1e12 bytes/s.
Your reply seems to be “but the brain isn’t storing 1e14 bytes of information”, but as other comments point out that has little to do with the neural circuit size.
The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.
This is a general computational principle, with many specific examples: compressed neural frequency encodings of 3D scenes (NERFs) which access/use all network parameters to decode a single point O(N) are enormously less computationally efficient (runtime throughput, latency, etc) than maximally sparse representations (using trees, hashtables etc) which approach O(log(N)) or O(C), but the sparse representations are enormously less compressed/compact. These tradeoffs are foundational and unavoidable.
We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it’s now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).
We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it’s now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).
I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people, maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right?
The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.
In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.
Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.
I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include
Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Deep learning algorithms trained to predict masked words from large amount of text have recently been shown to generate activations similar to those of the human brain. However, what drives this similarity remains currently unknown. Here, we systematically compare a variety of deep language models to identify the computational principles that lead them to generate brain-like representations of sentences
Here, we report a first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels.
We found a striking correspondence between the layer-by-layer sequence of embeddings from GPT2-XL and the temporal sequence of neural activity in language areas. In addition, we found evidence for the gradual accumulation of recurrent information along the linguistic processing hierarchy. However, we also noticed additional neural processes that took place in the brain, but not in DLMs, during the processing of surprising (unpredictable) words. These findings point to a connection between language processing in humans and DLMs where the layer-by-layer accumulation of contextual information in DLM embeddings matches the temporal dynamics of neural activity in high-order language areas.
Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it.
Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn’t lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).
In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.
The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it’s perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
The comparison should between GPT-3 and linguistic-cortex
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.)
For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task.
I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.).
I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that.
For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data…So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong. (Is that your argument?)
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Suppose (for the sake of argument—I don’t actually believe this) that human visual cortex is literally a ConvNet. And suppose that human scientists have never had the idea of ConvNets, but they have invented fully-connected feedforward neural nets. So they set about testing the hypothesis that “visual cortex is a fully-connected feedforward neural net”. I suspect that they would find a lot of evidence that apparently confirms this hypothesis, of the same sorts that you describe. For example, similar features would be learned in similar layers. There would be some puzzling discrepancies—especially sample efficiency, and probably also the handling of weird out-of-distribution inputs—but lots of experiments would miss those. So then (in this hypothetical universe) many people would be trumpeting the conclusion: “visual cortex is a fully-connected feedforward DNN”! But they would be wrong! And the careful neuroscientists—the ones who are scrutinizing brain structures, and/or doing experiments more sophisticated than correlating activities in unmanipulated naturalistic data, etc.—would be well aware of that.
There’s a skeptical discussion about specifically LLMs-vs-brains, with some references, in the first part of Section 5 of Bowers et al..
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong.
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.
I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like “we found brain areas that light up when the person reads ‘cat’, just like how this part of the neural net lights up when given input ‘cat’” and less like “the LLM is useful for other tasks in the same way as the neural version is useful for other tasks”. Am I confused about what the paper says, and if so, how? What sort of claim are you making?
I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops.
So when you say:
You are missing some foundational differences in how von neumann arch machines (GPUs) run neural circuits vs how neuromorphic hardware (like the brain) runs neural circuits.
The 4090 can hit around 1e14 - even up to 1e15 - flops/s, but only for dense matrix multiplication. The flops required to run a brain model using that dense matrix hardware are more like 1e17 flops/s, not 1e14 flops/s. The 1e14 synapses are at least 10x locally sparse in the cortex, so dense emulation requires 1e15 synapses (mostly zeroes) running at 100hz. The cerebellum is actually even more expensive to simulate .. because of the more extreme connection sparsity there.
But that isn’t the only performance issue. The GPU only runs matrix matrix multiplication, not the more general vector matrix multiplication. So in that sense the dense flop perf is useless, and the perf would instead be RAM bandwidth limited and require 100 4090′s to run a single 1e14 synapse model—as it requires about 1B of bandwidth per flop—so 1e14 bytes/s vs the 4090′s 1e12 bytes/s.
Your reply seems to be “but the brain isn’t storing 1e14 bytes of information”, but as other comments point out that has little to do with the neural circuit size.
The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.
This is a general computational principle, with many specific examples: compressed neural frequency encodings of 3D scenes (NERFs) which access/use all network parameters to decode a single point O(N) are enormously less computationally efficient (runtime throughput, latency, etc) than maximally sparse representations (using trees, hashtables etc) which approach O(log(N)) or O(C), but the sparse representations are enormously less compressed/compact. These tradeoffs are foundational and unavoidable.
We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it’s now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).
Thanks!
I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people, maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right?
In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.
Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.
Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.
And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless—because the task itself determines the solution.
Examples from recent neurosci literature:
From “Brains and algorithms partially converge in natural language processing”:
From “The neural architecture of language: Integrative modeling converges on predictive processing”:
From “Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain”
Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn’t lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).
The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it’s perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.
For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.
So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.)
I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.).
I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that.
Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong. (Is that your argument?)
Suppose (for the sake of argument—I don’t actually believe this) that human visual cortex is literally a ConvNet. And suppose that human scientists have never had the idea of ConvNets, but they have invented fully-connected feedforward neural nets. So they set about testing the hypothesis that “visual cortex is a fully-connected feedforward neural net”. I suspect that they would find a lot of evidence that apparently confirms this hypothesis, of the same sorts that you describe. For example, similar features would be learned in similar layers. There would be some puzzling discrepancies—especially sample efficiency, and probably also the handling of weird out-of-distribution inputs—but lots of experiments would miss those. So then (in this hypothetical universe) many people would be trumpeting the conclusion: “visual cortex is a fully-connected feedforward DNN”! But they would be wrong! And the careful neuroscientists—the ones who are scrutinizing brain structures, and/or doing experiments more sophisticated than correlating activities in unmanipulated naturalistic data, etc.—would be well aware of that.
There’s a skeptical discussion about specifically LLMs-vs-brains, with some references, in the first part of Section 5 of Bowers et al..
In your analysis the brain is using perhaps 1e13 flops/s (which I don’t disagree with much), and if linguistic cortex is 10% of that we get 1e12 flops/s, or 300B flops for 0.3 seconds.
GPT-3 uses all of its nearly 200B parameters for one forward pass, but the flops is probably 2x that (because the attention layers don’t use the long term params), and then you are using a ‘double pass’, so closer to 800B flops for GPT-3. Perhaps the total brain is 1e14 flops/s, so 3T flops for 0.3s of linguistic cortex but regardless its still using roughly the same amount of flops within our uncertainty range.
However as I mentioned earlier a model like GPT-3 running inference on a GPU is much less efficient than this, as many of the matrix mult calls (in training, parallelizing over time) become vector matrix mult calls and thus RAM bandwidth limited.
The brain’s sparsity obviously reduces the equivalent flop count vs running the same exact model on dense mat mul hardware, but interestingly enough ends up in roughly the same “flops used for inference” regime as the smaller dense model. However it is getting by with perhaps 100x less training data (for the linguistic cortex at least).
The scaling laws indicate that performance mostly depends on net training compute, and it doesn’t matter as much as you think how you allocate that between size/params (and thus inference flops) and time/data (training steps). A larger model spends more compute per training step to learn more from less data. GPT-3 used 3e23 flops for training, whereas the linguistic cortex uses perhaps 1e21 to 1e22 (1e13 * 1e9s), but GPT-3 trains on almost 3 OOM more equivalent token data and thus can be much smaller in proportion.
So the brain is more flop efficient, but only because it’s equivalent to a much larger dense model trained on much less data.
LLMs on GPUs are heavily RAM constrained but have plentiful data so they naturally have moved to the (smaller model, trained longer regime) vs the brain. For the brain synapses are fairly cheap, but training data time is not.
I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like “we found brain areas that light up when the person reads ‘cat’, just like how this part of the neural net lights up when given input ‘cat’” and less like “the LLM is useful for other tasks in the same way as the neural version is useful for other tasks”. Am I confused about what the paper says, and if so, how? What sort of claim are you making?