Elliot Callender comments on How to game the METR plot

Elliot Callender 30 Dec 2025 22:41 UTC
2 points
0
There’s something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit.
Perhaps humans generating training data are, for longer tasks, taking cognitive steps which are opaque to these models, or at least relatively more difficult to learn?
I’d wager 1:1 that this sort of abstraction-domain mismatch between human training data and LLMs is causing more of the HCAST weirdness than skewed finetuning investment.
- shash42 31 Dec 2025 2:12 UTC
  1 point
  0
  Parent
  That’s quite possible. I’m not sure how much that plays out with reinforcement learning training though.