I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.