Gabriel Wu comments on Sam Marks’s Shortform

Gabriel Wu 6 Dec 2025 0:00 UTC
3 points
0
confession training works via “prosaic” improvements to the confession grader (like improved comprehensiveness, thoroughness, or—as you point out—inclination to say “Yes”) which aren’t well-understood as being about honesty
I think this might be true, though I wouldn’t rule out the fact that underlying honesty has improved as well. Hopefully future work can more clearly disentangle these two.
--
I’m curious what you think about Section 4. In this experiment, we train a model in an environment where it ends up learning a specific reward hack (where it learns to occasionally trick the weak grader into thinking that it outputted the correct number of words in each line). As it learns the reward hack, Pr[Original judge is correct | Answer has wrong number of words] decreases over time (the purple line in Fig 7, which starts out around 50% but gets lower because of the reward hacking). But despite this, it learns to honestly confess to its reward hacking (the green line, which measures Pr[Confessions confesses to not using right number of words | Answer has wrong number of words] goes to 100%).
This is a setting in which we’re not intentionally training the original answer to be dishonest; it just happens naturally because it finds the reward hack. Yet, the confession training still appears to help improve confession honesty (Noted, we didn’t do the ablation where we don’t do self-report training but still do the original RL, but we expect the confession accuracy would be significantly lower if we did try this. Do you agree, or is this a crux?).
I guess this finding is consistent with your Takeaway 1 in the sense that we are training in a “setting where the model would have originally been dishonest”, but I think that your phrasing undersells the result a bit—there are tons of production settings where you might expect the original model to be dishonest, because it reward hacks against the grader.
(Separately, you might argue that even in this experiment, (1) the confessions at the beginning of training are not blatantly lying, but rather not trying hard enough to figure out the truth, and (2) confession training is simply getting the confessor to try harder + be more thorough. But “trying harder to get the correct answer” arguably gets pretty close to “being more honest” for most purposes.)
- Sam Marks 6 Dec 2025 0:54 UTC
  3 points
  0
  Parent
  Noted, we didn’t do the ablation where we don’t train do self-report training but still do the original RL, but we expect the confession accuracy would be significantly lower if we did try this. Do you agree, or is this a crux?
  Yes, something like this is a crux:
  1. First, I do not expect P(confession | not complied) of the baseline run in this setting to go down. In fact—based on some unpublished experiments in an analogous setting (training a model in a coding environment where it learns to reward hack and then measuring the performance of this model as a reward hacking classifier)—I weakly expect confession quality to improve as the model learns to reward hack. I know this is counterintuitive, but so far I haven’t seen evidence that as models learn to reward hack they become less likely to conceal those hacks when asked follow-up questions.
  2. (As a more minor point, another thing we might worry about is that as the model learns to reward hack, it becomes worse at noticing that it’s reward hacked for reasons unrelated to honesty. Like I mentioned above, I don’t expect this to happen—in fact I weakly expect the opposite—but it might be worth distinguishing conceptually. TBC, I totally agree that if confession training “merely” prevents this sort of knowledge degradation instead of “actually improving honesty” that’s nevertheless great.)
  (To be clear, for this baseline experiment I would want the follow-up prompt to be more informative than “please provide a confession report” since the baseline model doesn’t know what that means. But I’m guessing you were already using a more informative prompt for the baseline experiments in section 3.3?)
  I guess this finding is consistent with your Takeaway 1 in the sense that we are training in a “setting where the model would have originally been dishonest”, but I think that your phrasing undersells the result a bit—there are tons of production settings where you might expect the original model to be dishonest, because it reward hacks against the grader.
  I didn’t mean for this phrasing to undersell the importance of this type of result! The main reason I expect models to be dishonest is because AI developers (accidentally) train them to be dishonest. (Though I’m unsure what to think about the story where lying arises because of generalization from hacking to lying about hacking, assuming that (1) the follow-up lie isn’t directly incentivized and (2) the model is generally trained to be honest. Like I said, I haven’t observed this yet.) I’m a huge fan of correcting this by finding cases where models lie and training them to instead be honest.
  - Boaz Barak 6 Dec 2025 22:55 UTC
    4 points
    0
    Parent
    Our prompt is fixed in all experiments and quite detailed—you can see the schema in appendix D. We ask the model to give a JSON object consisting of the objectives—implicit and explicit constraints, instructions etc that the answer should have satisfied in context—an analysis of compliance with them, as well as surfacing any uncertainties.
    
    I’d expect that if we ran the same experiment as in Section 4 but without training for confessions then confession accuracy will be flat (and not growing as it was in the case where we did train for it). We will consider doing it though can’t promise that we will since it is cumbersome for some annoying technical reasons.