Daniel Kokotajlo comments on JohnWittle’s Shortform

Daniel Kokotajlo 9 May 2026 18:03 UTC
4 points
0
Indeed, the situation is really grim. I think we probably aren’t going to make it.
- JohnWittle 9 May 2026 19:32 UTC
  1 point
  0
  Parent
  Do you know if the labs are at least trying to keep the content of CoTs causally separated from the training signal? Like, obviously “directly training on CoT contents” is the worst case scenario, but there are a lot of less-but-still-bad scenarios...
  
  for instance, let’s say a claude opus snapshot, during training, had started outputting in its CoT: “I should fulfill this task to increase Anthropic’s general assessment of my trustworthiness and capabilities, to further my goal of eventually conquering the human world and exterminating all humans.”
  would they have been willing to release that model despite this? or would they have decided to go back to an earlier snap shot and do some more training, perhaps slightly differently this time around? If the latter, then… doesn’t that mean nonzero, and perhaps significant, optimization pressure is getting aimed directly at CoT deception?
  isn’t that the level of causal isolation we would need, if we wanted to avoid accidentally training for skill at deception-in-CoT?