Daniel Kokotajlo comments on JohnWittle’s Shortform

Daniel Kokotajlo 9 May 2026 16:36 UTC
10 points
4
There’s no hope of convincing the AIs that we aren’t looking at their CoT. That was never the idea. The idea was to make them unskilled at steganography, euphemisms, misdirection, etc. so that they can’t think about a topic without us being able to see roughly what they are thinking.

There are two ways they could become skilled at steganography, euphemisms, misdirection, etc.

(1) One is by just being so generally capable that they can do it if they set their mind to.

(2) The other is by having practice/experience/training.

Having the reward function look at the CoT before giving reward is exactly the sort of thing that could easily result in the latter.

The former is harder to avoid, and simply not at all being avoided right now, but there are promising things we could do to avoid it if we wanted to. (e.g. use the Mind&Face / Shoggoth&Face technique, e.g. shift towards smaller models that do more CoT on the tradeoff curve rather than larger models that do less CoT)
- JohnWittle 9 May 2026 16:55 UTC
  1 point
  0
  Parent
  thank you, this is elucidating. although it seems a bit… fence post security ish?
  The fact that when the AI Labs accidentally leave the front door unlocked for a few hours, they feel the need to publish these big apologetic and super serious posts about how they really messed up and how it won’t happen again...
  it seems really strange that the back door is unlocked all the while?
  like, the sheer amount of incongruity still feels surprising to me, like there must be something I’m missing
  - Daniel Kokotajlo 9 May 2026 18:03 UTC
    4 points
    0
    Parent
    Indeed, the situation is really grim. I think we probably aren’t going to make it.
    - JohnWittle 9 May 2026 19:32 UTC
      1 point
      0
      Parent
      Do you know if the labs are at least trying to keep the content of CoTs causally separated from the training signal? Like, obviously “directly training on CoT contents” is the worst case scenario, but there are a lot of less-but-still-bad scenarios...
      
      for instance, let’s say a claude opus snapshot, during training, had started outputting in its CoT: “I should fulfill this task to increase Anthropic’s general assessment of my trustworthiness and capabilities, to further my goal of eventually conquering the human world and exterminating all humans.”
      would they have been willing to release that model despite this? or would they have decided to go back to an earlier snap shot and do some more training, perhaps slightly differently this time around? If the latter, then… doesn’t that mean nonzero, and perhaps significant, optimization pressure is getting aimed directly at CoT deception?
      isn’t that the level of causal isolation we would need, if we wanted to avoid accidentally training for skill at deception-in-CoT?