StanislavKrym comments on ryan_greenblatt’s Shortform

StanislavKrym 3 Aug 2025 11:30 UTC
2 points
0
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
1. The time horizon of base LLMs experienced a slowdown or plateau^[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
2. Evaluation of Chinese models has DeepSeek’s time horizons^[2] change only from 18 to 31 minutes between^[3] V3 (Dec ’24) and R1-0528 (May ’25).
3. While Grok 4 was likely trained incompetently^[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require^[5] neuralese, which will likely prevent the humans from noticing misalignment.
1. ^
  While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
2. ^
  Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
3. ^
  In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
4. ^
  Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
5. ^
  There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
What links here?