Jozdien comments on Burny’s Shortform

Jozdien 1 Oct 2025 10:23 UTC
8 points
8
First, models can’t control the words in their reasoning traces well; even if they’re trying to hide them they often blurt out “I must not say that I’m doing X”.
All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don’t think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share, but there’s also existing work suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.
- models are no longer trained to narrate their thoughts
- hiding misaligned thoughts is rewarded during training
- the behaviors we’re trying to catch are possible in one forward pass for the model
The first two of these seem true in a relevant sense today?
- Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it’s a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
- Misaligned outputs are penalized during training. There’s work showing that optimizing outputs spills over to the CoT, so this really comes down to “hiding” misaligned thoughts as opposed to simply never having them.
- anaguma 1 Oct 2025 15:47 UTC
  3 points
  2
  Parent
  I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share…
  
  I’d be very interested if you could briefly describe these results!
  - Jozdien 13 Oct 2025 11:06 UTC
    11 points
    0
    Parent
    One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
    I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
    (Sorry for the late reply!)