For what it’s worth, while I definitely understand the criticism of the absurdly high eval standards, I believe they’re likely necessary to prevent problems that currently afflict evals, and the reason is that current evals don’t require long context holding/long-term memory, nor do they require anything like continuous learning in the weights, and I’ve been persuaded that a lot of the difference between benchmarks and real life usefulness lies here.
This also makes the undereliciation much less relevant of a problem, but at the same time I can’t understand why under this hypothesis Anthropic thought 1 of their models were dangerous, but the other wasn’t, when it should either have been both dangerous or none dangerous.
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
I do think Anthropic’s non-transparency on their reasoning for models being or not being dangerous is a problem, but I currently think Anthropic’s thresholds are reasonably defensible if we accept the thesis that current evals are basically worthless as capability markers.
For what it’s worth, while I definitely understand the criticism of the absurdly high eval standards, I believe they’re likely necessary to prevent problems that currently afflict evals, and the reason is that current evals don’t require long context holding/long-term memory, nor do they require anything like continuous learning in the weights, and I’ve been persuaded that a lot of the difference between benchmarks and real life usefulness lies here.
This also makes the undereliciation much less relevant of a problem, but at the same time I can’t understand why under this hypothesis Anthropic thought 1 of their models were dangerous, but the other wasn’t, when it should either have been both dangerous or none dangerous.
@lc explains the half of it here:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/?commentId=vFq87Ge27gashgwy9
And Dwarkesh explains the other half of it here:
https://www.dwarkesh.com/p/timelines-june-2025
I do think Anthropic’s non-transparency on their reasoning for models being or not being dangerous is a problem, but I currently think Anthropic’s thresholds are reasonably defensible if we accept the thesis that current evals are basically worthless as capability markers.