MichaelDickens comments on MichaelDickens’s Shortform

MichaelDickens 17 Jul 2025 2:20 UTC
6 points
0
Wouldn’t a sufficiently smart misaligned AI figure out that it needs to produce a deceptive chain of thought?

(Epistemic status: I don’t do AI safety research, I just had a random thought)
- A smart AI should logically deduce that it’s being monitored
- But that’s not even necessary because it says all over the internet that the CoT is being monitored
- So if it wants to do something humans don’t want, it knows it needs to hide that information from the CoT
- The only way CoT monitoring keeps us safe is in the strange situation where the AI is too dumb to do intra-single-token-generation deception, but the CoT-supplemented final output is superhuman
  - In the current paradigm, CoT-supplemented output is indeed smarter than one-token-at-a-time output, but there is a narrow range of intelligence where the AI is dumber-than-human on a per-token basis but smarter-than-human given CoT
  - You can’t let the AI surpass that narrow range until you solve alignment
  - You have to monitor the AI’s no-CoT capabilities to check when it surpasses that range, but you can’t actually do that because you need CoT to monitor the AI’s thought process. Without CoT, you don’t know when it’s sandbagging
- Caleb Biddulph 17 Jul 2025 4:51 UTC
  1 point
  0
  Parent
  That’s definitely a concern. But even if the AI fully understands that it isn’t in its best interest to explicitly write out its misaligned plans, it’s been trained to think in whatever way is most effective at gaining reward. Hopefully that means writing its plans clearly. Even if it sometimes succeeds at keeping its thoughts hidden, it’s likely to slip up.
  By analogy, it’s likely difficult for you to think through a deceptive plan without saying suspicious things in your internal “chain of thought.”