Wouldn’t a sufficiently smart misaligned AI figure out that it needs to produce a deceptive chain of thought?
(Epistemic status: I don’t do AI safety research, I just had a random thought)
A smart AI should logically deduce that it’s being monitored
But that’s not even necessary because it says all over the internet that the CoT is being monitored
So if it wants to do something humans don’t want, it knows it needs to hide that information from the CoT
The only way CoT monitoring keeps us safe is in the strange situation where the AI is too dumb to do intra-single-token-generation deception, but the CoT-supplemented final output is superhuman
In the current paradigm, CoT-supplemented output is indeed smarter than one-token-at-a-time output, but there is a narrow range of intelligence where the AI is dumber-than-human on a per-token basis but smarter-than-human given CoT
You can’t let the AI surpass that narrow range until you solve alignment
You have to monitor the AI’s no-CoT capabilities to check when it surpasses that range, but you can’t actually do that because you need CoT to monitor the AI’s thought process. Without CoT, you don’t know when it’s sandbagging
That’s definitely a concern. But even if the AI fully understands that it isn’t in its best interest to explicitly write out its misaligned plans, it’s been trained to think in whatever way is most effective at gaining reward. Hopefully that means writing its plans clearly. Even if it sometimes succeeds at keeping its thoughts hidden, it’s likely to slip up.
By analogy, it’s likely difficult for you to think through a deceptive plan without saying suspicious things in your internal “chain of thought.”
Wouldn’t a sufficiently smart misaligned AI figure out that it needs to produce a deceptive chain of thought?
(Epistemic status: I don’t do AI safety research, I just had a random thought)
A smart AI should logically deduce that it’s being monitored
But that’s not even necessary because it says all over the internet that the CoT is being monitored
So if it wants to do something humans don’t want, it knows it needs to hide that information from the CoT
The only way CoT monitoring keeps us safe is in the strange situation where the AI is too dumb to do intra-single-token-generation deception, but the CoT-supplemented final output is superhuman
In the current paradigm, CoT-supplemented output is indeed smarter than one-token-at-a-time output, but there is a narrow range of intelligence where the AI is dumber-than-human on a per-token basis but smarter-than-human given CoT
You can’t let the AI surpass that narrow range until you solve alignment
You have to monitor the AI’s no-CoT capabilities to check when it surpasses that range, but you can’t actually do that because you need CoT to monitor the AI’s thought process. Without CoT, you don’t know when it’s sandbagging
That’s definitely a concern. But even if the AI fully understands that it isn’t in its best interest to explicitly write out its misaligned plans, it’s been trained to think in whatever way is most effective at gaining reward. Hopefully that means writing its plans clearly. Even if it sometimes succeeds at keeping its thoughts hidden, it’s likely to slip up.
By analogy, it’s likely difficult for you to think through a deceptive plan without saying suspicious things in your internal “chain of thought.”