It reminds me of the failure mode described in Deep Deceptiveness, where an AI trained to never think deceptive thoughts ends up being deceptive anyway, through a similar mechanism of efficiently sampling a trajectory that leads to high reward without explicitly reasoning about it. There, the AI learns to do this at inference time, but I’ve been wondering about how we might see this during training—e.g. by safety training misgeneralizing to a model being unaware of a “bad” reason for it doing something.
Thanks for the link! Deep deceptiveness definitely seems relevant. I’d read the post before, but forgot about the details until rereading it now. This “discovering and applying different cognitive strategies” idea seems more plausible in the context of the new CoT reasoning models.
Thanks for the post, I agree with most of it.
It reminds me of the failure mode described in Deep Deceptiveness, where an AI trained to never think deceptive thoughts ends up being deceptive anyway, through a similar mechanism of efficiently sampling a trajectory that leads to high reward without explicitly reasoning about it. There, the AI learns to do this at inference time, but I’ve been wondering about how we might see this during training—e.g. by safety training misgeneralizing to a model being unaware of a “bad” reason for it doing something.
Thanks for the link! Deep deceptiveness definitely seems relevant. I’d read the post before, but forgot about the details until rereading it now. This “discovering and applying different cognitive strategies” idea seems more plausible in the context of the new CoT reasoning models.