Human brain holds 200-300 trillion synapses. A 1:32 sparse MoE at high compute will need about 350 tokens/parameter to be compute optimal[1]. This gives 8T active parameters (at 250T total), 2,700T training tokens, and 2e29 FLOPs (raw compute GPT-6 that needs a $300bn training system with 2029 hardware).
There won’t be enough natural text data to train it with, even when training for many epochs. Human brain clearly doesn’t train primarily on external data (humans blind from birth still gain human intelligence), so there exists some kind of method for generating much more synthetic data from a little bit of external data.
I’m combining the 6x lower-than-dense data efficiency of 1:32 sparse MoE from Jan 2025 paper with 1.5x-per-1000x-compute decrease in data efficiency from Llama 3 compute optimal scaling experiments, anchoring to Llama 3′s 40 tokens/parameter for a dense model at 4e25 FLOPs. Thus 40x6x1.5, about 350. It’s tokens per active parameter, not total.
Isn’t it fairly obvious that the human brain starts with a lot of pretraining just built in by evolution? I know that some people make the argument that the human genome does not contain nearly enough data to make up for the lack of subsequent training data, but I do not have a good intuition for how apparently data efficient an LLM would be that can train on a limited amount of real world training data plus synthetic reasoning traces of a tiny teacher model that has been heavily optimised with massive data and compute (like the genome has). I also don’t think that we could actually reconstruct a human just from the genome (I expect transferring the nucleus of a fertilised human egg into, say, a chimpanzee ovum and trying to gestate it in the womb of some suitable mammal would already fail for incompatibility reasons), so the cellular machinery that runs the genome probably carries a large amount of information beyond just the genome as well, in the sense that we need that exact machinery to run the genome.
In many other species it is certainly the case that much of the intelligence of the animal seems hardwired genetically. The speed at which some animal acquires certain skills therefore does not tell us too much about the existence of efficient algorithms to learn the same behaviours from little data starting from scratch.
I think parts of the brain are non-pretrained learning algorithms, and parts of the brain are not learning algorithms at all, but rather innate reflexes and such. See my post Learning from scratch in the brain for justification.
My view is that all innate reflexes are a form of software operating on the organic turing machine that is our body. For more info on this you can look at the thinking of michael levin and joscha bach.
I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a “small fraction”, which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.
Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.
Is GPT4.5′s ?10T parameters really a “small fraction” of the human brain’s 80B neurons and 100T synapses?
Human brain holds 200-300 trillion synapses. A 1:32 sparse MoE at high compute will need about 350 tokens/parameter to be compute optimal[1]. This gives 8T active parameters (at 250T total), 2,700T training tokens, and 2e29 FLOPs (raw compute GPT-6 that needs a $300bn training system with 2029 hardware).
There won’t be enough natural text data to train it with, even when training for many epochs. Human brain clearly doesn’t train primarily on external data (humans blind from birth still gain human intelligence), so there exists some kind of method for generating much more synthetic data from a little bit of external data.
I’m combining the 6x lower-than-dense data efficiency of 1:32 sparse MoE from Jan 2025 paper with 1.5x-per-1000x-compute decrease in data efficiency from Llama 3 compute optimal scaling experiments, anchoring to Llama 3′s 40 tokens/parameter for a dense model at 4e25 FLOPs. Thus 40x6x1.5, about 350. It’s tokens per active parameter, not total.
Isn’t it fairly obvious that the human brain starts with a lot of pretraining just built in by evolution? I know that some people make the argument that the human genome does not contain nearly enough data to make up for the lack of subsequent training data, but I do not have a good intuition for how apparently data efficient an LLM would be that can train on a limited amount of real world training data plus synthetic reasoning traces of a tiny teacher model that has been heavily optimised with massive data and compute (like the genome has). I also don’t think that we could actually reconstruct a human just from the genome (I expect transferring the nucleus of a fertilised human egg into, say, a chimpanzee ovum and trying to gestate it in the womb of some suitable mammal would already fail for incompatibility reasons), so the cellular machinery that runs the genome probably carries a large amount of information beyond just the genome as well, in the sense that we need that exact machinery to run the genome.
In many other species it is certainly the case that much of the intelligence of the animal seems hardwired genetically. The speed at which some animal acquires certain skills therefore does not tell us too much about the existence of efficient algorithms to learn the same behaviours from little data starting from scratch.
I think parts of the brain are non-pretrained learning algorithms, and parts of the brain are not learning algorithms at all, but rather innate reflexes and such. See my post Learning from scratch in the brain for justification.
My view is that all innate reflexes are a form of software operating on the organic turing machine that is our body. For more info on this you can look at the thinking of michael levin and joscha bach.
I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a “small fraction”, which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.
Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.