StanislavKrym comments on StanislavKrym’s Shortform

StanislavKrym 9 Oct 2025 21:53 UTC
1 point
0
It looks as if scaling laws of various benchmarks tend to be multilinear:
- The METR benchmark, comparing long tasks with time spent on them, scaled linearly, then received RL and had an acceleration, then the scaling law of ln(length) per ln(compute spent on RL) forced progress to arguably^[1] slow down since Grok 4 spent equal amounts of compute on RL and pretraining;
- The ARC-AGI-1 benchmark had o4-mini, o3 and GPT-5 perform on a nearly straight line on which the better results of Cluade also reside;
- Similarly, the benchmark’s Pareto frontier before the cluster around GPT-5(high) has become a nearly straight line (GPT5Nano (minimal)-Qwen3-235b-a22b Instruct (25/07)- three GPT5Mini points—ARChitects-GPT5(high));
- LLMs have also formed a line GPT5(high)-Grok 4- GPT5 Pro—o3 preview (low);
- The inclination of the line formed by Pang’s and Berman’s agents is close to that of the line formed by high-cost LLMs;
- Next is the ARC-AGI-2 benchmark. While there is no straight line in the low-cost LLMs, the high-cost LLMs reached a straight line of Claude Sonnet 4.5, Grok 4, GPT-5-pro;
- And the agents of Pang and Berman have reached similar inclinations.
EDIT: added two links on images illustrating the patterns related to the two ARC-AGI benchmarks.
1. ^
  While GPT-5′s horizon of 137 mins continued the slower trend since o3, it might be the result of spurious failures, without which GPT-5 could’ve reached a horizon of 161 min, which is almost on par with Greenblatt’s prediction.
What links here?