Will the journey from here to AGI feature “aha” moments?
Looks like it did feature such moments in the past. The METR graph that you quote had a GPT4-GPT4o plateau, and all subsequent models used CoTs and context window lengtheners and rapidly increased compute spendings on RL. This strategy began to crumble when Claude Opus 4 (who didn’t even reach SOTA on time horizon), Grok 4 and GPT-5 failed to follow the 4o-o3[1] faster trend.
something deep about the nature of large tasks vs. small tasks, and the cognitive skills that people and LLMs bring to each.
A human brain, unlike current AIs, has a well-developed dynamic memory which is OOMs bigger (and OOMs worse trained, forcing evolution to use high learning rates) than current context windows or CoTs, let alone the number of neurons in a layer of a LLM. What if the key to AGI lies in a similar direction?
However, METR observed the trend by using 4o-o1 because o3 had yet to be released. Another complication is that the set of METR’s tasks is no longer as reliable as it once was, potentially causing us to underestimate the models’ abilities.
Looks like it did feature such moments in the past. The METR graph that you quote had a GPT4-GPT4o plateau, and all subsequent models used CoTs and context window lengtheners and rapidly increased compute spendings on RL. This strategy began to crumble when Claude Opus 4 (who didn’t even reach SOTA on time horizon), Grok 4 and GPT-5 failed to follow the 4o-o3[1] faster trend.
A human brain, unlike current AIs, has a well-developed dynamic memory which is OOMs bigger (and OOMs worse trained, forcing evolution to use high learning rates) than current context windows or CoTs, let alone the number of neurons in a layer of a LLM. What if the key to AGI lies in a similar direction?
However, METR observed the trend by using 4o-o1 because o3 had yet to be released. Another complication is that the set of METR’s tasks is no longer as reliable as it once was, potentially causing us to underestimate the models’ abilities.