Boaz Barak comments on Sam Marks’s Shortform

Boaz Barak 6 Dec 2025 22:55 UTC
4 points
0
Our prompt is fixed in all experiments and quite detailed—you can see the schema in appendix D. We ask the model to give a JSON object consisting of the objectives—implicit and explicit constraints, instructions etc that the answer should have satisfied in context—an analysis of compliance with them, as well as surfacing any uncertainties.

I’d expect that if we ran the same experiment as in Section 4 but without training for confessions then confession accuracy will be flat (and not growing as it was in the case where we did train for it). We will consider doing it though can’t promise that we will since it is cumbersome for some annoying technical reasons.