The way they use the word “aligned” in that paper is very weird to me :P (they basically use it as a synonym for “instruction-following” or “post-trained”).
But I feel like this method could actually be adapted for AI alignment/safety. It’s kind of similar to my “incremental steering” idea, but instead of a strong untrusted model guiding a weak trusted model, there’s a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own.
Here’s one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model’s CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model—the important thing is that we trust the weak model isn’t motivated to scheme/reward-hack/be misaligned.
Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn’t naturally do. Also, if you’re auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model.
I’d be interested in a longer writeup and/or whatever other experiments you want to try!
The way they use the word “aligned” in that paper is very weird to me :P (they basically use it as a synonym for “instruction-following” or “post-trained”).
But I feel like this method could actually be adapted for AI alignment/safety. It’s kind of similar to my “incremental steering” idea, but instead of a strong untrusted model guiding a weak trusted model, there’s a weak post-trained model guiding a strong base model. This also looks more practical than incremental steering, because it alternates between the weak model and the strong model, rather than giving the weak model a single token of guidance and then leaving it entirely on its own.
Here’s one thing you could do: rather than using an off-the-shelf strong model, you train the strong model to get reward (making it untrusted). But it can only do so by inserting tokens into the weak model’s CoT. (The strong model decides when to insert a token, rather than doing this automatically whenever the weak model is uncertain.) Each token the strong model inserts is associated with a penalty, so it has to use as few tokens as possible. Unlike in the Nudging paper, the weak model might not be a base model—the important thing is that we trust the weak model isn’t motivated to scheme/reward-hack/be misaligned.
Safety benefits: Because the strong/untrusted model has to rely on the weak/trusted model to do most of its thinking for it, it should be harder for it to do steganography or any kind of thinking that the weak model wouldn’t naturally do. Also, if you’re auditing the strong model, this should be easier, because you can just look very carefully at the relatively-few tokens written by the strong model and think about how they influence the weak model.
I’d be interested in a longer writeup and/or whatever other experiments you want to try!