shash42

Karma: 437

Humans can post on moltbook

shash4231 Jan 2026 21:06 UTC

24 points

3 comments1 min readLW link

OpenForecaster: How to train language models for open-ended forecasting?

nikhilchandak, shash42 and bayesian_kitten

7 Jan 2026 11:03 UTC

10 points

1 comment7 min readLW link

shash42 31 Dec 2025 2:12 UTC
1 point
0
in reply to: Elliot Callender’s comment on: How to game the METR plot
That’s quite possible. I’m not sure how much that plays out with reinforcement learning training though.

shash42 21 Dec 2025 16:54 UTC
3 points
0
in reply to: Parker Whitfill’s comment on: How to game the METR plot
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).

shash42 21 Dec 2025 11:41 UTC
2 points
0
in reply to: Afterimage’s comment on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
The mean estimate of 50% success horizon length (headline number METR reports) went from ~1 to ~4 hours. The progress within the hour subranges is difficult to draw much information from, given the low number of data points, and distribution biases in topics. This is the precise claim of the new post I made, and linked :)

shash42 21 Dec 2025 3:48 UTC
5 points
0
in reply to: Megan Kinniment’s comment on: How to game the METR plot
Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.

shash42 20 Dec 2025 17:52 UTC
11 points
3
in reply to: Lukas Finnveden’s comment on: How to game the METR plot
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...

shash42 20 Dec 2025 17:33 UTC
8 points
0
in reply to: Lukas Finnveden’s comment on: How to game the METR plot
Thanks, I should’ve done that myself instead of lazily mentioning what it “looked like”. R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.

shash42 20 Dec 2025 13:56 UTC
14 points
0
on: Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins
https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot
Claude’s performance is low on the 2-4 hour range, which mostly consists of cybersecurity tasks, potentially dual-use for safety. In general, training on cybersecurity CTFs and ML code would increase “horizon length” on the METR plot, which only has 14 samples in the relevant (1 − 4hr) range where progress happened in 2025.
What links here?

How to game the METR plot

shash4220 Dec 2025 13:46 UTC

240 points

29 comments5 min readLW link

shash42 13 Sep 2025 16:27 UTC
1 point
0
in reply to: Karl Krueger’s comment on: Why scaling is profitable, and fast takeoffs will look slow on AI benchmarks
Its a linkpost for https://arxiv.org/abs/2509.09677
I did make it a linkpost, not sure if just adding a summary isn’t traditional?

shash42 6 Jul 2025 11:57 UTC
2 points
0
in reply to: Justin Olive’s comment on: Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm

New Paper: It is time to move on from MCQs for LLM Evaluations

shash426 Jul 2025 11:48 UTC

9 points

0 comments2 min readLW link

An Alternative Way to Forecast AGI: Counting Down Capabilities

shash4229 Jun 2025 19:52 UTC

3 points

0 comments3 min readLW link

(open.substack.com)

shash42 30 May 2025 14:52 UTC
4 points
0
in reply to: Neel Nanda’s comment on: Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
Yes, that is a good takeaway!

shash42 30 May 2025 12:22 UTC
9 points
0
in reply to: Neel Nanda’s comment on: Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.

I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.

All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

shash4229 May 2025 18:40 UTC

66 points

7 comments1 min readLW link

(safe-lip-9a8.notion.site)

shash42 16 Apr 2025 12:51 UTC
4 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Plug, but model mistakes have been getting more similar as capabilities increase. This also means that these correlated failures appearing now will go away together

shash42 14 Apr 2025 9:29 UTC
1 point
0
on: Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Various baselines have long been underrated in Interp literature, and now that we are re-realizing their importance, I’ll bring up some results we found in MATS′23 that in hindsight should probably have received more attention: https://www.lesswrong.com/posts/JCgs7jGEvritqFLfR/evaluating-hidden-directions-on-the-utility-dataset

We found that linear probes are great for classification, but they mostly fit spurious correlations, which can still be good if prediction is the end goal, such as when trying to identify deception. However, the directions found by a linear probe can’t be steered or ablated.

What works really well (and is still under-explored) is the vector obtained by subtracting the class means. Both for causal steering and classification. LEACE showed that this is infact the optimal linear erasure theoretically.

Unsupervised methods (like PCA) are less good at prediction but still quite good for causal interventions.

These results only got published as Figure 12 of the Representation Engineering paper, though in hindsight maybe it would have helped to highlight them more prominently as a paper of their own. SAEs were just being ‘discovered’ around this time (I remember @Hoagy was working on this at MATS′23) so we didn’t benchmark them unfortunately.

shash42 8 Apr 2025 5:45 UTC
1 point
0
in reply to: nostalgebraist’s comment on: Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks
Thanks! I fixed the last paragraph accordingly. I indeed wanted to say faster-than-linearly for the highest feasible k.

shash42

Hu­mans can post on moltbook

OpenFore­caster: How to train lan­guage mod­els for open-ended fore­cast­ing?