This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).
shash42
The mean estimate of 50% success horizon length (headline number METR reports) went from ~1 to ~4 hours. The progress within the hour subranges is difficult to draw much information from, given the low number of data points, and distribution biases in topics. This is the precise claim of the new post I made, and linked :)
Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Thanks, I should’ve done that myself instead of lazily mentioning what it “looked like”. R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.
https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot
Claude’s performance is low on the 2-4 hour range, which mostly consists of cybersecurity tasks, potentially dual-use for safety. In general, training on cybersecurity CTFs and ML code would increase “horizon length” on the METR plot, which only has 14 samples in the relevant (1 − 4hr) range where progress happened in 2025.
- 's comment on Are We In A Coding Overhang? by (27 Dec 2025 11:09 UTC; 8 points)
- 's comment on 2025 Year in Review by (31 Dec 2025 20:29 UTC; 4 points)
- 's comment on AI #148: Christmas Break by (25 Dec 2025 15:33 UTC; 2 points)
- 's comment on StanislavKrym’s Shortform by (12 Jan 2026 1:07 UTC; 2 points)
- 's comment on anaguma’s Shortform by (14 Jan 2026 2:20 UTC; 2 points)
- 's comment on Is METR Underestimating LLM Time Horizons? by (19 Jan 2026 15:43 UTC; 1 point)
- 's comment on MP’s Shortform by (7 Jan 2026 16:42 UTC; 1 point)
Its a linkpost for https://arxiv.org/abs/2509.09677
I did make it a linkpost, not sure if just adding a summary isn’t traditional?
Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm
Yes, that is a good takeaway!
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.
I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.
All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
Plug, but model mistakes have been getting more similar as capabilities increase. This also means that these correlated failures appearing now will go away together
Various baselines have long been underrated in Interp literature, and now that we are re-realizing their importance, I’ll bring up some results we found in MATS′23 that in hindsight should probably have received more attention: https://www.lesswrong.com/posts/JCgs7jGEvritqFLfR/evaluating-hidden-directions-on-the-utility-dataset
We found that linear probes are great for classification, but they mostly fit spurious correlations, which can still be good if prediction is the end goal, such as when trying to identify deception. However, the directions found by a linear probe can’t be steered or ablated.
What works really well (and is still under-explored) is the vector obtained by subtracting the class means. Both for causal steering and classification. LEACE showed that this is infact the optimal linear erasure theoretically.
Unsupervised methods (like PCA) are less good at prediction but still quite good for causal interventions.
These results only got published as Figure 12 of the Representation Engineering paper, though in hindsight maybe it would have helped to highlight them more prominently as a paper of their own. SAEs were just being ‘discovered’ around this time (I remember @Hoagy was working on this at MATS′23) so we didn’t benchmark them unfortunately.
Thanks! I fixed the last paragraph accordingly. I indeed wanted to say faster-than-linearly for the highest feasible k.
These results empirically resolve for me why scaling will continue to be economically rational despite logarithmic gains in (many) benchmark performance. https://www.lesswrong.com/posts/dAYemKXz4JDFQk8QE/log-linear-scaling-is-worth-the-cost-due-to-gains-in-long
That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.
Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.
Its still a cool idea though, and I would be happy to see it work out :)
[1] Andreas, Jacob et al., “Neural Module Networks.”, CVPR 2016
[2] Ebrahimi, Sayna, et al. “Adversarial continual learning.” ECCV 2020
Thanks for pointing me to Figure 12, it alleviates my concern! I don’t fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.
Thanks for sharing these interesting results!
I am a big fan of reporting unlearning results across identified forget set fractions! That said, I think the unlearning results lack comparisons to important ablations/baselines which would really test if gradient routing is adding value. For eg:
1. CF (catastrophic forgetting) - This would involve removing most components of ERA, only keeping the finetuning on the retain set.2. Ascent + CF—This would involve a light touch of gradient ascent (maximizing the loss) on the forget set, with simultaneous finetuning on the retain set. See [1] or AC↯DC in [2] for good implementations.
3. Methods that combine these concepts specifically for LLMs, like LLMU [3]
Without these, it is difficult to know if gradient routing is actually adding any value on top of what can be achieved with traditional finetuning.
Also, the SSD method has been shown to perform well on the setup of partial deletion sets [4], so another thing to check would be comparing Potion (a followup to SSD) [5] + finetuning on the retain set, which would stress-test the hypothesis of “we need gradient routing through a new subnetwork instead of just finding the relevant parts of the existing network”.
[1] Trippa, Daniel, et al. “$\nabla\tau $: Gradient-based and Task-Agnostic machine Unlearning.” CoRR 2024
[2] Kolipaka, Varshita, et al. “A Cognac shot to forget bad memories: Corrective Unlearning in GNNs.” arXiv preprint arXiv:2412.00789 (2024).
[3] Yao, Yuanshun, Xiaojun Xu, and Yang Liu. “Large language model unlearning.” arXiv preprint arXiv:2310.10683 (2023).
[4] Goel, Shashwat, et al. “Corrective machine unlearning.” TMLR 2024
[5] Schoepf, Stefan, Jack Foster, and Alexandra Brintrup. “Potion: Towards Poison Unlearning.” DMLR Journal 2024
Thanks, the rationale for using PCA was quite interesting. I also quite like the idea of separating different model classes for this evaluation.
I think it’s interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won’t be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:
Using model performance based correlation: If I’m not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. This seems more a metric of how AI progress has been in the past, rather than saying much about the benchmark itself. Its quite possible that models with more capabilities also have more safety interventions (as they came out later, where presumably more safety research had been used), and that’s why there is a correlation. On the flipside, if future model releases apply weapon-risk reduction techniques like unlearning, then those benchmarks will also start showing a positive correlation. Thus, I’m not sure if this methodology provides robust insights for judging benchmarks. Further, it can also be gamed (artificially lower correlation) by strategically picking more models with safety interventions applied.Projection on principal component: Why is comparing the projection on the first principal component of the Capability/Safety_benchmark x Model matrix preferable to just comparing average accuracies across capabilities and safety benchmarks?
That’s quite possible. I’m not sure how much that plays out with reinforcement learning training though.