Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as “hitting a bumper”? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn’t going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a “good” human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).
This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as “hitting a bumper”? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn’t going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a “good” human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).