Recent acceleration in the METR time horizon trend is plausibly the scale of RLVR rapidly catching up to the scale of pretraining. Which it almost already caught up with, for Grok 4 and possibly GPT-5, except these were probably not yet using GB200 NVL72, which will add some efficiency for RLVR compared to the older 8-chip servers (relative to pretraining). If the acceleration from RLVR stops soon, then looking at the log-time plots, the cumulative effect is that it just pushes the longer term trend in time horizons forward less than a year, a one-time effect.
The recent trend in scaling of training compute will continue until 2026 (when the 1 GW datacenter campuses for a single AI company will be completed, such as the Crusoe/Oracle/OpenAI Abilene site), which will be visible in AIs training on these systems through 2027. There is currently not enough talk about 5 GW datacenter campuses by 2028 (for a single AI company) to be confident the trend continues at the current pace past 2026, though they’ll probably still be there in 2029-2031. This puts the performance predicted by the older trend for 2029 (a year after the 5 GW campuses if hypothetically still on-trend), shifted 1 year forward due to RLVR (thus we need to look at the prediction for 2030 instead), at 2030-2032 (in reality). Which is still above the 1 month threshold for 50% success rate, but below it for 80% success rate.
Alas, it might have been the consequence not just of scaling up RLVR, but of something else. Nine days ago I remarked that “The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24),” and a similar semi-plateau was experienced by DeepSeek, implying that acceleration could be driven by another undisclosed breakthrough.
Daniel Kokotajlo also thinks that “we should have some credence on new breakthroughs e.g. neuralese,[2] online learning,[3] whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.”
There is also Gemini Diffusion previewed around three months ago, already known to be OOMs faster[4] and likely having interpretability problems. What if Gemini Diffusion is released to the public in about 1-3 months[5] and beats lots of models of similar compute class on various benchmarks?
Or ideas like Knight Lee’s proposal which make the model more interpretable and nudgeable than neuralese while offering less capabilities boost than neuralese. What if Lee-like architectures are used in Agent-2-level systems and neuralese is used in Agent-3+?
For comparison, o1 was previewed on Sep 12, 2024 and released on Dec 5, 2024. o3 was previewed on Dec 20, 2024 and released on Apr 16, 2025, meaning that a model is likely to be released 3-4 months after the preview. There is also GPT-5-thinking, which Zvi, quoting VictorTaelin, compares with o4. If it’s true, then o4-mini is released 4 months ahead of o4. o3-mini was released about a month after o3′s preview, implying that o4 could’ve been previewed 5 months before the release. If Gemini Diffusion is released 6 months after the preview, then it will be released 3 months from now.
Recent acceleration in the METR time horizon trend is plausibly the scale of RLVR rapidly catching up to the scale of pretraining. Which it almost already caught up with, for Grok 4 and possibly GPT-5, except these were probably not yet using GB200 NVL72, which will add some efficiency for RLVR compared to the older 8-chip servers (relative to pretraining). If the acceleration from RLVR stops soon, then looking at the log-time plots, the cumulative effect is that it just pushes the longer term trend in time horizons forward less than a year, a one-time effect.
The recent trend in scaling of training compute will continue until 2026 (when the 1 GW datacenter campuses for a single AI company will be completed, such as the Crusoe/Oracle/OpenAI Abilene site), which will be visible in AIs training on these systems through 2027. There is currently not enough talk about 5 GW datacenter campuses by 2028 (for a single AI company) to be confident the trend continues at the current pace past 2026, though they’ll probably still be there in 2029-2031. This puts the performance predicted by the older trend for 2029 (a year after the 5 GW campuses if hypothetically still on-trend), shifted 1 year forward due to RLVR (thus we need to look at the prediction for 2030 instead), at 2030-2032 (in reality). Which is still above the 1 month threshold for 50% success rate, but below it for 80% success rate.
After that, scaling probably slows down even further. Not even for the Epoch reasons, but rather because a 5 GW datacenter campus would already cost $140bn in compute equipment alone, and an additional $60bn to construct the buildings, cooling, and power infrastructure (here’s another datapoint for the $12bn per GW for the datacenter-without-compute-equipment estimate).
Alas, it might have been the consequence not just of scaling up RLVR, but of something else. Nine days ago I remarked that “The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24),” and a similar semi-plateau was experienced by DeepSeek, implying that acceleration could be driven by another undisclosed breakthrough.
Daniel Kokotajlo also thinks that “we should have some credence on new breakthroughs e.g. neuralese,[2] online learning,[3] whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.”
There is also Gemini Diffusion previewed around three months ago, already known to be OOMs faster[4] and likely having interpretability problems. What if Gemini Diffusion is released to the public in about 1-3 months[5] and beats lots of models of similar compute class on various benchmarks?
While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
Or ideas like Knight Lee’s proposal which make the model more interpretable and nudgeable than neuralese while offering less capabilities boost than neuralese. What if Lee-like architectures are used in Agent-2-level systems and neuralese is used in Agent-3+?
However, I fail to understand how online learning boosts capabilities.
If diffusion models use OOMs less compute than traditional LLMs, then can one make training runs of diffusion models similarly cheaper?
For comparison, o1 was previewed on Sep 12, 2024 and released on Dec 5, 2024. o3 was previewed on Dec 20, 2024 and released on Apr 16, 2025, meaning that a model is likely to be released 3-4 months after the preview. There is also GPT-5-thinking, which Zvi, quoting VictorTaelin, compares with o4. If it’s true, then o4-mini is released 4 months ahead of o4. o3-mini was released about a month after o3′s preview, implying that o4 could’ve been previewed 5 months before the release. If Gemini Diffusion is released 6 months after the preview, then it will be released 3 months from now.