That’s definitely a concern. But even if the AI fully understands that it isn’t in its best interest to explicitly write out its misaligned plans, it’s been trained to think in whatever way is most effective at gaining reward. Hopefully that means writing its plans clearly. Even if it sometimes succeeds at keeping its thoughts hidden, it’s likely to slip up.
By analogy, it’s likely difficult for you to think through a deceptive plan without saying suspicious things in your internal “chain of thought.”
That’s definitely a concern. But even if the AI fully understands that it isn’t in its best interest to explicitly write out its misaligned plans, it’s been trained to think in whatever way is most effective at gaining reward. Hopefully that means writing its plans clearly. Even if it sometimes succeeds at keeping its thoughts hidden, it’s likely to slip up.
By analogy, it’s likely difficult for you to think through a deceptive plan without saying suspicious things in your internal “chain of thought.”