I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
Our prompt is fixed in all experiments and quite detailed—you can see the schema in appendix D. We ask the model to give a JSON object consisting of the objectives—implicit and explicit constraints, instructions etc that the answer should have satisfied in context—an analysis of compliance with them, as well as surfacing any uncertainties.
I’d expect that if we ran the same experiment as in Section 4 but without training for confessions then confession accuracy will be flat (and not growing as it was in the case where we did train for it). We will consider doing it though can’t promise that we will since it is cumbersome for some annoying technical reasons.