Nobody knows what amount of compute is sufficient for AGI, in the sense of capability for mostly autonomous research, especially with some algorithmic improvements.
This is what I find really puzzling. The human brain, which only crossed the sapience threshold a quarter-million-years of evolution ago, has O(1014) synapses, and a presumably a lot of evolved genetically-determined inductive biases. Synapses have very sparse connectivity, so synapse counts should presumably be compared to parameter counts after sparsification, which tends to reduce them by 1-2 orders of magnitude. GPT-4 is believed to have O(1012) parameters: it’s an MoE model so has some sparsity and some duplication, so call that O(1010or1011) for a comparable number. So GPT-4 is showing “sparks of AGI” something like 3 or 4 orders of magnitude before we would expect AGI from a biological parallel. I find that astonishingly low. Bear in mind also that a human brain only needs to implement one human mind, whereas an LLM is trying to learn to simulate every human who’s ever written material on the Internet in any high/medium-resource language, a clearly harder problem.
I don’t know if this is evidence that AGI is a lot easier than humans make it look, or a lot harder than GPT-4 makes it look? Maybe controlling a real human body is an incredibly compute-intensive task (but then I’m pretty sure that < 90% of the human brain’s synapses are devoted to motor control and controlling the internal organs, and more than 10% are used for language/visual processing, reasoning, memory, and executive function). Possibly we’re mostly still fine-tuned for something other than being an AGI? Given the implications for timelines, I’d really like to know.
I had a thought. When comparing parameter counts of LLMs to synapse counts, for parity the parameter count of each attention head should be multiplied by the number of locations that it can attend to, or at least its logarithm. That would account for about an order of magnitude of the disparity. So make that 2-3 orders of magnitude. That sounds rather more plausible for sparks of AGI to full AGI.
This is what I find really puzzling. The human brain, which only crossed the sapience threshold a quarter-million-years of evolution ago, has O(1014) synapses, and a presumably a lot of evolved genetically-determined inductive biases. Synapses have very sparse connectivity, so synapse counts should presumably be compared to parameter counts after sparsification, which tends to reduce them by 1-2 orders of magnitude. GPT-4 is believed to have O(1012) parameters: it’s an MoE model so has some sparsity and some duplication, so call that O(1010 or 1011) for a comparable number. So GPT-4 is showing “sparks of AGI” something like 3 or 4 orders of magnitude before we would expect AGI from a biological parallel. I find that astonishingly low. Bear in mind also that a human brain only needs to implement one human mind, whereas an LLM is trying to learn to simulate every human who’s ever written material on the Internet in any high/medium-resource language, a clearly harder problem.
I don’t know if this is evidence that AGI is a lot easier than humans make it look, or a lot harder than GPT-4 makes it look? Maybe controlling a real human body is an incredibly compute-intensive task (but then I’m pretty sure that < 90% of the human brain’s synapses are devoted to motor control and controlling the internal organs, and more than 10% are used for language/visual processing, reasoning, memory, and executive function). Possibly we’re mostly still fine-tuned for something other than being an AGI? Given the implications for timelines, I’d really like to know.
I had a thought. When comparing parameter counts of LLMs to synapse counts, for parity the parameter count of each attention head should be multiplied by the number of locations that it can attend to, or at least its logarithm. That would account for about an order of magnitude of the disparity. So make that 2-3 orders of magnitude. That sounds rather more plausible for sparks of AGI to full AGI.