I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
Thanks for posting this—I was going to link Nikola’s work!
Good catch, will fix.
Thanks for writing this!
As StanislavKyrm has written elsewhere, we have indeed tested Claude Code and Codex, and previously, we’ve done the same with other agent scaffolds in the literature.
Onto the other two problems you raise:
I don’t know if I’d say “egregious”. I agree the confidence intervals are large, but since they’re largely a result of uncertainties in the task suite, the differences between models are generally significant. Probably the real fix is a better way to present this uncertainty.
Yep, real problem. Opus 4.6′s “real” 50% time horizon is accurately described as “beyond the ability of METR to accurately measure with the Time Horizon 1.1 Suite”.
We’re working on it!
For example, we made a suite of tasks in collaboration with Epoch (MirrorCode), that unfortunately turned out to be 1) less representative of the tasks currently in the suite and 2) seems to have been partially saturated while it was being developed.
As I’ve written elsewhere, it’s pretty hard to make long tasks, and it’s not just a monetary expense.