Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.